Service A calls Service B. Service B is down. Service A keeps retrying. Now both services are down.

Circuit breakers prevent this cascade.

The Problem#

You have a microservice calling an external payment API. API goes down. Your service waits for timeout (say 30 seconds) on each request. Threads pile up waiting. Request queue grows. Memory fills up. Your service crashes.

The retry storm makes it worse. When external API tries to recover, it gets hammered by backed-up retries. Goes down again.

You need to fail fast instead of waiting.

How Circuit Breakers Work#

Three states: Closed, Open, Half-Open.

Closed (Normal Operation):

  • Requests pass through to downstream service
  • Track success/failure rate
  • If failures exceed threshold, transition to Open

Open (Failing Fast):

  • Stop sending requests to failing service
  • Return error immediately (no waiting for timeout)
  • Save threads, memory, prevent cascade
  • After wait duration (e.g., 60 seconds), transition to Half-Open

Half-Open (Testing Recovery):

  • Allow limited requests through (e.g., 3 requests)
  • All succeed? Back to Closed (service recovered)
  • Any fail? Back to Open (not ready yet)
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% stateDiagram-v2 [*] --> Closed Closed --> Open: Failure rate > threshold Open --> HalfOpen: Wait duration elapsed HalfOpen --> Closed: Test requests succeed HalfOpen --> Open: Test requests fail Closed --> Closed: Normal operation Open --> Open: Reject requests immediately

Circuit breaker state transitions. Open state prevents cascade failures.

Java Implementation with Resilience4j#

@Service
public class PaymentService {
    
    @CircuitBreaker(name = "paymentAPI", fallbackMethod = "paymentFallback")
    public PaymentResponse processPayment(PaymentRequest request) {
        // Call external payment API
        return externalPaymentAPI.charge(request);
    }
    
    // Fallback when circuit is open or call fails
    public PaymentResponse paymentFallback(PaymentRequest request, Exception ex) {
        // Return cached response, default value, or error
        return PaymentResponse.serviceUnavailable();
    }
}

Configuration (application.yml):

resilience4j:
  circuitbreaker:
    instances:
      paymentAPI:
        slidingWindowSize: 10              # Track last 10 requests
        failureRateThreshold: 50           # Open if 50% fail
        waitDurationInOpenState: 60s       # Stay open for 60 seconds
        permittedNumberOfCallsInHalfOpenState: 3  # Allow 3 test requests
        slowCallDurationThreshold: 5s      # Call is slow if > 5 seconds
        slowCallRateThreshold: 50          # Open if 50% slow

Real-World Thresholds#

Failure rate threshold: 50% over 10 requests is common. Too sensitive (20%)? Opens on minor blips. Too lenient (80%)? Damage already done.

Wait duration: 60 seconds typical. Too short? Keep testing service that’s still down. Too long? Users wait unnecessarily when service recovers.

Half-open calls: 3-5 requests. Enough to verify recovery without overwhelming recovering service.

What I’ve Seen#

Microservices architecture calling external APIs (payment gateways, notification services). Without circuit breakers, one slow external API kills the entire service. Threads blocked waiting for timeouts. Memory exhausted. Kubernetes keeps restarting pods.

Added Resilience4j circuit breakers. Service stayed up even when external dependencies went down. Users got immediate errors instead of hanging requests. Operations team could see which dependency was failing (circuit open = problem there, not here).

The key insight: failing fast is better than hanging. Circuit breakers let your service survive even when dependencies don’t.

Have you used circuit breakers in production? What thresholds work for your services?