Request fails. You retry immediately. Fails again. Retry immediately. Server gets hammered. Everything crashes.

Naive retries make outages worse.

The Problem with Immediate Retries#

Service goes down for 2 seconds. 10,000 clients hit timeout. All retry immediately. Service comes back up, gets hit with 10,000 simultaneous requests. Dies again.

This is a retry storm. Your retries prevent the service from recovering.

Exponential Backoff#

Wait longer between each retry. First retry after 1 second. Second after 2 seconds. Third after 4 seconds. Fourth after 8 seconds.

public class ExponentialBackoffRetry {
    private static final int MAX_RETRIES = 5;
    private static final long BASE_DELAY_MS = 1000;
    
    public Response makeRequest(String url) throws Exception {
        Exception lastException = null;
        
        for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
            try {
                return httpClient.send(url);
            } catch (Exception e) {
                lastException = e;
                
                if (attempt < MAX_RETRIES - 1) {
                    long delay = BASE_DELAY_MS * (1L << attempt);  // 2^attempt
                    Thread.sleep(delay);
                }
            }
        }
        
        throw lastException;
    }
}

Delays: 1s, 2s, 4s, 8s, 16s. Gives server breathing room.

Problem: all clients still retry at same time. Just spread out more.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant C as Client participant S as Server C->>S: Request (attempt 1) S-->>C: ❌ 503 Error Note over C: Wait 1s C->>S: Request (attempt 2) S-->>C: ❌ 503 Error Note over C: Wait 2s C->>S: Request (attempt 3) S-->>C: ❌ 503 Error Note over C: Wait 4s C->>S: Request (attempt 4) S-->>C: ✓ 200 OK

Exponential backoff increases delay between retries.

Adding Jitter#

Randomize the delay. Instead of exactly 4 seconds, wait between 2-6 seconds. Spreads out retry attempts.

public class ExponentialBackoffWithJitter {
    private static final int MAX_RETRIES = 5;
    private static final long BASE_DELAY_MS = 1000;
    private static final Random random = new Random();
    
    public Response makeRequest(String url) throws Exception {
        Exception lastException = null;
        
        for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
            try {
                return httpClient.send(url);
            } catch (Exception e) {
                lastException = e;
                
                if (attempt < MAX_RETRIES - 1) {
                    long baseDelay = BASE_DELAY_MS * (1L << attempt);
                    
                    // Full jitter: random between 0 and baseDelay
                    long delay = (long) (random.nextDouble() * baseDelay);
                    
                    Thread.sleep(delay);
                }
            }
        }
        
        throw lastException;
    }
}

Jitter strategies:

Full jitter: Random between 0 and max delay. Most spreading, but sometimes very short waits.

Equal jitter: Base delay/2 + random(0, base delay/2). Guarantees minimum wait, still spreads.

Decorrelated jitter: Next delay based on previous delay, not just attempt number. AWS recommendation.

// Decorrelated jitter (AWS style)
long previousDelay = BASE_DELAY_MS;

for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
    try {
        return httpClient.send(url);
    } catch (Exception e) {
        if (attempt < MAX_RETRIES - 1) {
            long minDelay = BASE_DELAY_MS;
            long maxDelay = previousDelay * 3;
            
            previousDelay = minDelay + (long)(random.nextDouble() * (maxDelay - minDelay));
            Thread.sleep(previousDelay);
        }
    }
}

When to Give Up#

Can’t retry forever. Set limits.

Max retries: 3-5 attempts typical. More than 5? Probably not transient failure.

Max total time: Don’t retry for more than 30 seconds total. User waiting.

Circuit breaker integration: If circuit is open, don’t retry. Fail fast.

public class RetryWithTimeout {
    private static final int MAX_RETRIES = 5;
    private static final long BASE_DELAY_MS = 1000;
    private static final long MAX_TOTAL_TIME_MS = 30_000;
    
    public Response makeRequest(String url) throws Exception {
        long startTime = System.currentTimeMillis();
        Exception lastException = null;
        
        for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
            // Check total time
            if (System.currentTimeMillis() - startTime > MAX_TOTAL_TIME_MS) {
                throw new TimeoutException("Retry timeout exceeded");
            }
            
            try {
                return httpClient.send(url);
            } catch (Exception e) {
                lastException = e;
                
                if (attempt < MAX_RETRIES - 1) {
                    long delay = calculateDelay(attempt);
                    Thread.sleep(delay);
                }
            }
        }
        
        throw lastException;
    }
}

Retry Based on Error Type#

Not all failures should retry.

Retry:

  • 503 Service Unavailable (temporary)
  • 504 Gateway Timeout
  • Network errors (connection refused, timeout)

Don’t retry:

  • 400 Bad Request (client error, won’t succeed)
  • 401 Unauthorized (fix auth, don’t hammer)
  • 404 Not Found (resource doesn’t exist)
public boolean shouldRetry(Exception e) {
    if (e instanceof HttpException) {
        int status = ((HttpException) e).getStatusCode();
        return status == 503 || status == 504 || status == 429;
    }
    
    if (e instanceof SocketTimeoutException || e instanceof ConnectException) {
        return true;  // Network issues
    }
    
    return false;  // Don't retry client errors
}

What I’ve Seen#

Message processing system consuming from RabbitMQ. Transient database connection failures. Without retries, messages failed permanently. With naive immediate retries, retry storm overwhelmed DB during recovery.

Added exponential backoff with jitter. Max 5 retries, 30 second total timeout. Transient failures recovered. Database got breathing room during incidents. Message failure rate dropped 90%.

The key: retries are necessary, but naive retries are dangerous. Add backoff and jitter.

How does your retry logic work?