Tail Latency: The P99 Problem

Your dashboard says average latency is 50ms. Everything looks healthy. But 1% of your users are waiting 3 seconds. Some are timing out entirely.

Averages lie. P99 tells the truth.

Why Averages Hide Problems#

100 requests. 99 complete in 40ms. One takes 5 seconds. Average: 89ms. Looks fine. That one user? Furious.

Now add fan-out. Your API calls 5 backend services in parallel. Each service has a 1% chance of being slow. The chance that at least one is slow: ~5%. At 10 services, it’s ~10%. Your API’s P99 is worse than any individual service’s P99.

This is the tail-at-scale problem. The more services you call, the more likely you’ll hit someone’s bad tail.

graph TD A[API Request] --> B[Service A: 40ms] A --> C[Service B: 40ms] A --> D[Service C: 40ms] A --> E[Service D: 3000ms P99 hit] E --> F[Total: 3000ms] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style F fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

What Causes Tail Latency#

It’s rarely one thing. Garbage collection pauses, noisy neighbors on shared hardware, lock contention, cold caches, background compaction in your database. The request that hits GC on Service A, a cold cache on Service B, and compaction on Service C gets the worst of everything.

At Oracle, our P50 was 30ms but P99 was 2+ seconds. Took us weeks to figure out the cause: MySQL query plan flips. Under load, the optimizer would occasionally pick a table scan over an index. Same query, same data, wildly different performance.

Hedged Requests#

The simplest tail latency fix: send the same request to two backends. Take whichever responds first. You burn extra capacity but your P99 drops dramatically.

CompletableFuture<Response> primary = callService(request);
CompletableFuture<Response> hedge = CompletableFuture.supplyAsync(() -> {
    sleep(p50Latency);  // wait before hedging
    return callService(request);
});
CompletableFuture.anyOf(primary, hedge).thenAccept(this::respond);

The trick: don’t hedge immediately. Wait until the P50 latency passes. If the primary hasn’t responded by then, it’s probably in the tail. Only then send the hedge. This keeps extra load under control.

Google does this extensively. Their paper “The Tail at Scale” showed hedged requests can reduce P99 by 50%+ with only 2-5% extra load.

Deadline Propagation#

If the caller already gave up, don’t keep working. Propagate timeouts as deadlines through the call chain. gRPC does this natively. For REST, pass a deadline header and check it before expensive operations.

if (Instant.now().isAfter(deadline)) {
    throw new DeadlineExceededException();
}

No point computing a response nobody’s waiting for.

What I’m Learning#

P99 is where your system’s weaknesses compound. Every dependency, every shared resource, every background job is a potential tail latency contributor. Averages smooth this out. P99 exposes it.

The shift for me: stop optimizing average latency. Start measuring and reducing tail latency. That’s where user pain actually lives.

What percentile do you alert on? P95? P99? P99.9?