Circuit Breakers: Failing Fast to Stay Alive
Service A calls Service B. Service B is down. Service A keeps retrying. Now both services are down.
Circuit breakers prevent this cascade.
The Problem#
You have a microservice calling an external payment API. API goes down. Your service waits for timeout (say 30 seconds) on each request. Threads pile up waiting. Request queue grows. Memory fills up. Your service crashes.
The retry storm makes it worse. When external API tries to recover, it gets hammered by backed-up retries. Goes down again.
You need to fail fast instead of waiting.
How Circuit Breakers Work#
Three states: Closed, Open, Half-Open.
Closed (Normal Operation):
- Requests pass through to downstream service
- Track success/failure rate
- If failures exceed threshold, transition to Open
Open (Failing Fast):
- Stop sending requests to failing service
- Return error immediately (no waiting for timeout)
- Save threads, memory, prevent cascade
- After wait duration (e.g., 60 seconds), transition to Half-Open
Half-Open (Testing Recovery):
- Allow limited requests through (e.g., 3 requests)
- All succeed? Back to Closed (service recovered)
- Any fail? Back to Open (not ready yet)
Circuit breaker state transitions. Open state prevents cascade failures.
Java Implementation with Resilience4j#
@Service
public class PaymentService {
@CircuitBreaker(name = "paymentAPI", fallbackMethod = "paymentFallback")
public PaymentResponse processPayment(PaymentRequest request) {
// Call external payment API
return externalPaymentAPI.charge(request);
}
// Fallback when circuit is open or call fails
public PaymentResponse paymentFallback(PaymentRequest request, Exception ex) {
// Return cached response, default value, or error
return PaymentResponse.serviceUnavailable();
}
}
Configuration (application.yml):
resilience4j:
circuitbreaker:
instances:
paymentAPI:
slidingWindowSize: 10 # Track last 10 requests
failureRateThreshold: 50 # Open if 50% fail
waitDurationInOpenState: 60s # Stay open for 60 seconds
permittedNumberOfCallsInHalfOpenState: 3 # Allow 3 test requests
slowCallDurationThreshold: 5s # Call is slow if > 5 seconds
slowCallRateThreshold: 50 # Open if 50% slow
Real-World Thresholds#
Failure rate threshold: 50% over 10 requests is common. Too sensitive (20%)? Opens on minor blips. Too lenient (80%)? Damage already done.
Wait duration: 60 seconds typical. Too short? Keep testing service that’s still down. Too long? Users wait unnecessarily when service recovers.
Half-open calls: 3-5 requests. Enough to verify recovery without overwhelming recovering service.
What I’ve Seen#
Microservices architecture calling external APIs (payment gateways, notification services). Without circuit breakers, one slow external API kills the entire service. Threads blocked waiting for timeouts. Memory exhausted. Kubernetes keeps restarting pods.
Added Resilience4j circuit breakers. Service stayed up even when external dependencies went down. Users got immediate errors instead of hanging requests. Operations team could see which dependency was failing (circuit open = problem there, not here).
The key insight: failing fast is better than hanging. Circuit breakers let your service survive even when dependencies don’t.
Have you used circuit breakers in production? What thresholds work for your services?