SLOs and Error Budgets: When Good Enough is a Number
“We need 100% uptime.” No, you don’t. And you can’t have it anyway.
What you need is a number. A specific, measurable target that tells you when reliability is good enough and when it’s not. That’s an SLO.
SLI, SLO, SLA#
SLI (Service Level Indicator): What you measure. Request latency, error rate, availability. Concrete metrics from your structured logs or monitoring system.
SLO (Service Level Objective): What you promise internally. “99.9% of requests complete under 200ms over a 30-day window.” The target for your SLI.
SLA (Service Level Agreement): What’s in the contract with customers. Usually looser than SLO. If you violate the SLA, there are financial consequences (credits, refunds).
Your SLO should be tighter than your SLA. If the SLA is 99.9%, set the SLO at 99.95%. That gives you a buffer.
Error Budget Math#
99.9% availability over 30 days means you can be down for 43 minutes. That’s your error budget.
30 days x 24 hours x 60 minutes = 43,200 minutes
0.1% error budget = 43.2 minutes of downtime
| SLO | Monthly Error Budget |
|---|---|
| 99% | 7.2 hours |
| 99.9% | 43 minutes |
| 99.95% | 21 minutes |
| 99.99% | 4.3 minutes |
Every incident, every slow deploy, every P99 spike eats into that budget.
What Happens When the Budget Burns#
This is where SLOs become powerful. Error budget is a decision-making tool.
Budget remaining: Ship features. Take risks. Deploy faster. The system can absorb some failures.
Budget burned: Freeze feature deploys. Focus entirely on reliability. Fix the circuit breakers, add timeouts, reduce tail latency.
public class ErrorBudgetTracker {
private final double sloTarget; // 0.999
private final int windowDays; // 30
public double remainingBudget(long totalRequests, long failedRequests) {
double actualSuccess = 1.0 - ((double) failedRequests / totalRequests);
double budgetUsed = (sloTarget - actualSuccess) / (1.0 - sloTarget);
return Math.max(0, 1.0 - budgetUsed);
}
}
Picking the Right SLI#
Not every metric is a good SLI. Pick metrics that reflect user experience.
Good SLIs: request latency (P99), error rate (5xx responses), availability (successful requests / total requests).
Bad SLIs: CPU usage (users don’t care about your CPU), deployment count (activity isn’t quality).
At Oracle, we initially used average latency as our SLI. Dashboard looked great. But averages hide problems. A small percentage of users were hitting 3+ second response times. We switched to P99 latency as the primary SLI. Suddenly the dashboard told the truth, and we had to actually fix the slow paths instead of declaring victory on averages.
What I’m Learning#
SLOs changed how I think about reliability. Before, it was “minimize all errors.” Now it’s “how much error can we tolerate while still serving users well?” The shift from “zero errors” to “error budget” is liberating. It means you can make informed trade-offs between shipping speed and reliability.
The hardest part isn’t the math. It’s getting your team to agree on the number. Too tight and you can’t ship anything. Too loose and users suffer.
What SLO does your team target? Do you actually freeze deploys when the budget burns?