All tests pass. 100% of health checks green. Monitoring looks beautiful. Then a single Redis node goes down, and your checkout flow returns 500s for 20 minutes. Your circuit breaker was configured but never actually triggered in production. It had a bug. You never knew because you never broke Redis on purpose.

Chaos engineering is the practice of deliberately injecting failures to find these gaps before your users do.

The Steady-State Hypothesis#

Before breaking anything, define what “normal” looks like. Request success rate above 99.5%. Latency P99 under 200ms. No error spikes. This is your steady-state hypothesis.

Then inject a fault. Kill a service instance. Add 500ms latency to database calls. Block network traffic between two services. Observe: does the system stay within steady state?

If yes, your resilience patterns work. If no, you found a gap before production did.

graph TD D["Define Steady State (P99 < 200ms, errors < 0.5%)"] --> I["Inject Fault"] I --> O["Observe Metrics"] O --> C{Within steady state?} C -->|Yes| P["Resilience works"] C -->|No| F["Found a gap: fix it"] F --> D style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style I fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style O fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style P fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Types of Faults to Inject#

Process failure: Kill a service instance. Does the load balancer route around it? How long until health checks detect it?

Network latency: Add 500ms delay between service A and service B. Do timeouts fire correctly? Does the circuit breaker open?

Network partition: Block traffic between two services entirely. Do retries storm? Does backpressure kick in?

Resource exhaustion: Fill the disk. Exhaust the connection pool. Leak memory. These are the failures that creep up slowly and are hardest to test.

Blast Radius Control#

This is critical. Don’t start by killing production databases. Start small.

// Inject latency into a single service endpoint
@Component
public class ChaosLatencyFilter implements Filter {
    @Value("${chaos.latency.enabled:false}")
    private boolean enabled;

    @Value("${chaos.latency.ms:0}")
    private int latencyMs;

    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
            throws IOException, ServletException {
        if (enabled && shouldApply()) {
            Thread.sleep(latencyMs);
        }
        chain.doFilter(req, res);
    }

    private boolean shouldApply() {
        return ThreadLocalRandom.current().nextInt(100) < 10; // 10% of requests
    }
}

Start with a single endpoint, 10% of requests, in staging. Expand to production only after you’ve validated the approach. Target one service at a time. Always have a kill switch.

What Chaos Reveals#

The failures you find are almost never the ones you expect. Common surprises:

Retry strategies without jitter cause thundering herd effects when the failed service recovers. Timeouts set to 30 seconds when the user gives up after 3. Circuit breakers configured to open after 50 failures when 5 would be appropriate. Connection pools that don’t shrink when a dependency is down, exhausting connections to healthy services.

At Oracle, we discovered a nasty interaction through accidental chaos. A deployment restarted an NSSF service node during peak hours. The other nodes detected the restart and began retrying failed requests, all simultaneously. The returning node got hammered with retry storms before it was fully initialized. Adding artificial latency to the config service path during testing (our version of fault injection) revealed that the retry configuration had no jitter and a very aggressive initial interval. Fixing the retry config with exponential backoff plus jitter reduced retry messages by 50%. We would never have caught this with unit tests alone.

What I’m Learning#

Chaos engineering isn’t about breaking things for fun. It’s about building confidence that your resilience patterns actually work. The key insight for me: tests verify that code handles expected failures. Chaos engineering verifies that the system handles unexpected failure combinations. The scariest failures are always the ones you didn’t think to test.

Have you tried chaos engineering, even informally? What was the most surprising failure you discovered?