Timeouts: The Hardest Easy Problem

Service A calls Service B. Service B is slow. How long should A wait?

Too short: A gives up on requests that would have succeeded. Users see errors. Retries pile on. B gets hammered.

Too long: A’s threads block waiting. Connection pool drains. A becomes slow. A’s callers timeout. The slowness spreads.

There’s no safe default. And yet I’ve seen codebases with no timeouts at all.

The No-Timeout Trap#

No timeout means infinite wait. One stuck request holds a thread forever.

// Dangerous: no timeout
HttpResponse response = httpClient.send(request, BodyHandlers.ofString());

I’ve seen services brought down by a single dependency that went unresponsive. All threads blocked. Health checks passed (the app was alive, just stuck). Load balancer kept sending traffic. Death by hanging threads.

Always set a timeout. Always.

Too Short#

Aggressive timeouts feel safe. “Fail fast!” But consider:

Service B’s p99 latency is 800ms. You set a 500ms timeout. 1% of requests fail. Under load, latency increases. Now 10% fail. Retries kick in. B gets 10% more traffic. Latency climbs further. 30% fail.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000'}}}%% graph TD A[Timeout too short] --> B[Some requests fail] B --> C[Retries increase load] C --> D[Latency increases] D --> E[More requests fail] E --> C style A fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Short timeouts plus retries can turn a slow service into a dead service. The timeout “protected” you right into an outage.

Too Long#

Long timeouts seem generous. “Give it time to recover!” But resources are finite.

You have 200 threads. Each request to slow Service B holds a thread for 30 seconds. After 7 seconds, all 200 threads are waiting. New requests queue. Your service is now slow too.

Your callers have timeouts. They give up. But your threads are still waiting. You’re doing work that no one wants anymore.

Cascading Timeouts#

The real complexity: chains.

User → API Gateway → Service A → Service B → Database

Each hop has a timeout. If the gateway’s timeout is 10 seconds, but A’s timeout to B is 30 seconds, A might still be waiting when the gateway has already given up and told the user “error.”

A finishes, returns a response, but the connection is closed. Wasted work.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000'}}}%% sequenceDiagram autonumber participant U as User participant G as Gateway (10s) participant A as Service A (30s) participant B as Service B U->>G: Request G->>A: Request A->>B: Request Note over B: Slow... 15 seconds Note over G: 10s timeout fires G-->>U: 504 Gateway Timeout B-->>A: Response (too late) A-->>G: Response (connection closed) Note over A: Wasted work

Timeouts should decrease as you go deeper. Gateway: 10s. A to B: 8s. B to DB: 5s. Leave room for each layer.

Timeout Budgets#

A smarter approach: propagate the remaining time.

Gateway has 10 seconds. It spends 1 second on processing, passes “9 seconds remaining” to Service A. A spends 500ms, passes “8.5 seconds remaining” to B. Everyone knows how much time is left.

// Pseudocode
Duration remaining = deadline.minus(Instant.now());
if (remaining.isNegative()) {
    throw new DeadlineExceededException();
}
serviceB.call(request, remaining);

gRPC does this with deadlines. The deadline propagates through the call chain automatically. If you’re past the deadline, don’t even start the work.

What to Set#

No universal answer, but guidelines:

Connection timeout: Short. 1-5 seconds. If you can’t connect in 5 seconds, the server is probably down.

Read timeout: Longer. Based on expected response time. p99 latency + buffer. If p99 is 500ms, maybe 2 seconds.

Total timeout: The user-facing limit. How long will a human wait? 10 seconds? 30 seconds? Work backwards from that.

HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(2))
    .build();

HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("https://api.example.com"))
    .timeout(Duration.ofSeconds(5))  // total timeout
    .build();

What I’m Learning#

Timeouts taught me that distributed systems fail in slow motion. A crash is obvious. A slow service is insidious. It spreads through the system like poison.

The mental shift: timeouts aren’t about handling failure. They’re about containing it. A good timeout says “I’d rather fail fast and free resources than wait for something that might never come.”

When I see a service with no timeouts, I now see a service waiting to freeze. When I see aggressive timeouts without circuit breakers, I see a retry storm waiting to happen.

The right timeout depends on your dependencies, your SLOs, and your users’ patience. There’s no formula. Just measurement and judgment.

What’s the worst timeout-related outage you’ve experienced?