SLOs & error budgets

"Be reliable" is not a goal; it is a mood. An SLO turns it into a number — this service will succeed for 99.9% of requests over 30 days — and the gap between that number and perfection becomes a budget you are allowed to spend on releases, experiments, and the occasional bad deploy. That one move changes everything downstream: what you alert on, when you page a human, and how the team that ships features negotiates with the team that gets woken up.

Why 100% is the wrong target

Start from the uncomfortable truth: past a certain point, users cannot tell the difference. A user on a phone with flaky Wi-Fi, behind a mobile carrier, talking to your service through two CDNs already experiences more failure from everything else in the path than your last nine would remove. Meanwhile every extra nine costs dramatically more than the one before it — redundant everything, slower releases, more conservative engineering — and the only thing a 100% target guarantees is that you will miss it.

So the question stops being "how reliable can we be" and becomes "how reliable do we need to be, for these users, for this feature, at a cost we accept." That is a product decision dressed up as an engineering one, and the SLO is where you write the answer down. Everything else on this page — error budgets, burn rates, the budget policy — is machinery for holding yourself to a number once you have had the honesty to pick one.

SLI, SLO, SLA: three letters apart, very different jobs

The terms get blurred constantly, and the blur causes real damage, so be precise. An SLI (service level indicator) is a measurement — the thing you can actually compute from telemetry. An SLO (service level objective) is a target for that measurement over a window — the promise you make to yourselves. An SLA (service level agreement) is a contract with consequences — the promise you make to customers, with refunds or penalties attached when you break it.

Term	What it is	Example	Audience
SLI	A measurement	Fraction of requests served successfully in under 400 ms	Engineers
SLO	A target for the SLI	99.9% of requests, over a rolling 30 days	The team & its stakeholders
SLA	A contract on (usually) a looser target	99.5% monthly, or customers get service credits	Customers, lawyers

The ordering matters: the SLA should always be looser than the SLO, so that the internal alarm goes off well before money is owed. If your SLO and SLA are the same number, you have no margin — the first time you learn you are in trouble is the moment you owe refunds. And teams without external customers still want SLOs; the SLA is the optional part, not the objective.

Choosing SLIs: measure what users feel, as a ratio

A good SLI has a specific shape: the proportion of good events over valid events. Availability is "successful requests / total requests." Latency is "requests faster than the threshold / total requests." Freshness for a pipeline is "records processed within X minutes / total records." Expressing everything as a ratio between 0 and 100% is what lets one error-budget mechanism work for all of them.

Two rules of thumb do most of the work. First, measure as close to the user as you can. A server-side success metric misses requests that never arrived — the load balancer that dropped them, the DNS failure, the timeout the client gave up on. Load-balancer logs beat server metrics; client-side telemetry beats both, at the cost of noise you do not control. Second, a handful of SLIs, not a dashboard of them. For a request-driven service, availability and latency cover most of what users feel. Each additional SLI dilutes attention and multiplies alerts; three to five per user-facing journey is plenty.

Note the framing "per journey," not "per microservice." Users do not experience your services; they experience checkout, search, sign-in. An SLO on each user journey, measured at the edge, tells you something true about the product. Forty per-service SLOs mostly tell you which teams were optimistic at planning time.

The tail-latency trap: never average your way to "fine"

The classic mistake is to set a latency SLO on the average: "mean latency under 200 ms." Latency distributions are violently skewed — a fat cluster of fast requests and a long tail of slow ones — and the mean sits comfortably inside the fast cluster while one user in fifty waits three seconds. An averaged SLO can be green for months while your heaviest users, the ones whose requests do the most work, quietly suffer. Worse, an average can be dragged down by a flood of fast trivial requests, masking a genuine regression on the requests that matter.

The mean lives in the fast cluster. A threshold-ratio SLI makes the tail count against the budget.

The fix is the ratio form above: pick a threshold users actually notice and count the fraction of requests that beat it — "99% of requests complete within 400 ms." That makes a slow request a budget-burning event like any error, folds latency into the same machinery as availability, and keeps the tail visible. If you need two thresholds (one for "fast," one for "tolerable"), set two SLOs; do not average.

The error budget: unreliability you are allowed to spend

Once the SLO exists, the budget falls out by arithmetic: it is simply 100% minus the target. A 99.9% SLO over 30 days allows 0.1% of events to fail — if traffic were perfectly steady, about 43 minutes of full outage, or twice as long at half-impact, or a 0.1% error rate trickling all month. The budget does not care how you spend it; failed requests are failed requests whether they came from one dramatic outage or a slow leak.

The reframe is the point. Without a budget, every error is bad and the implicit target is perfection, so risk-taking is a moral failing. With a budget, unreliability is a resource: plenty of budget left means ship faster, run the risky migration, skip the second staging soak. Budget nearly gone means slow down, harden, pay the reliability debt. The budget gives the people who ship and the people who get paged a shared number to argue over instead of arguing over feelings — and it gives "no" a price tag, which is the only thing that has ever made prioritisation conversations honest.

Quick arithmetic worth memorising. Over 30 days: 99.9% leaves 43 m 12 s of budget; 99.95% leaves 21 m 36 s; 99.99% leaves 4 m 19 s. The jump from three nines to four is the one that forces architectural change — at four nines a human cannot be in the recovery loop, because paging, waking, and orienting someone already spends the month's budget.

Burn rate: alerting on the budget, not the threshold

Naive SLO alerting — "page if the error rate exceeds 0.1% for five minutes" — fails in both directions. It pages at 3am for a blip that, over the month, would consume a rounding error of budget. And it can stay silent through a slow leak that quietly eats the whole budget by day twenty. The unit is wrong: thresholds measure the moment, but the SLO is a promise about the window.

Burn rate fixes the unit. It is how fast you are consuming budget relative to the rate that would spend exactly all of it by the end of the window. Burn rate 1 means you will land precisely on the SLO — for a 99.9% target, a steady 0.1% error rate. Burn rate 10 means the month's budget is gone in three days. For a 99.9% SLO, a 14.4× burn rate corresponds to a 1.44% error rate, which exhausts 2% of the 30-day budget in a single hour. Alerting on burn rate means alert urgency finally tracks the thing you actually promised.

Multi-window, multi-burn-rate: the pattern that ends alert fatigue

The mature version, worked out in the Google SRE workbook's "Alerting on SLOs" chapter, layers several burn rates over several windows, each with its own response. Fast burns page a human immediately; slow burns file a ticket for working hours. The standard configuration for a 99.9%, 30-day SLO:

Burn rate	Long window	Short window	Budget consumed	Response
14.4×	1 hour	5 min	2% in 1 h	Page
6×	6 hours	30 min	5% in 6 h	Page
1×	3 days	6 hours	10% in 3 d	Ticket

Each tier fires only when both its windows exceed the burn rate. The long window provides confidence that the burn is real and sustained; the short window — typically one-twelfth of the long one — makes the alert stop quickly once the problem is fixed, instead of grumbling for the rest of the hour about errors that already ended. Without the short window, your highest-confidence alerts are also your slowest to reset, which trains people to ignore them.

Three failure shapes, three responses. The slow leak never trips a threshold alert — burn-rate tiers catch it anyway.

The deeper shift is philosophical: this is alerting on symptoms, not causes. You no longer page on "CPU is high" or "disk is filling" — conditions that may or may not hurt anyone — you page on "we are measurably breaking the promise we made," and you investigate causes once you are awake. Cause-based alerts become dashboards and tickets. Pages become rare, and every one of them means a user is feeling it right now.

The budget policy: what actually happens at zero

An error budget with no agreed consequence is a chart, not a mechanism. The budget policy is the document — written in peacetime, agreed by engineering and product leadership — that says what happens as the budget drains. A typical ladder: at 50% consumed, reliability work moves up the backlog. Budget exhausted: feature releases pause, except changes that improve reliability; the postmortem backlog gets staffed; risky migrations wait for the window to roll. Repeatedly exhausted across quarters: the SLO itself is renegotiated, because either the target is wrong or the investment is.

Two things make a policy real. It is agreed before the bad month, so invoking it is executing a plan rather than starting a fight. And it binds both directions — if the budget is comfortably unspent quarter after quarter, that is also a signal: you are likely over-investing in reliability, or the SLO is set looser than what users have come to depend on. (Users calibrate to delivered reliability, not promised reliability — run too far above your SLO for too long and your actual performance becomes the de facto promise. Google has gone as far as deliberately injecting downtime into a too-reliable service to shake out the dependents who had quietly assumed it was perfect.)

Pitfalls worth avoiding

Averaged latency SLIs — covered above; use threshold ratios. SLO = SLA — leaves no margin between "we noticed" and "we owe money." Too many SLOs — forty objectives means none of them gates anything; pick the few journeys that matter. Measuring where it is convenient — server-side metrics miss the failures users see most; measure at the load balancer or client. 100% targets — unpayable, and they make the budget zero, which makes every mechanism on this page divide by it. A budget nobody enforces — without a pre-agreed policy, the budget is decoration and the burn-rate alerts are just more noise. Excluding "planned" failures from the SLI — users experience your maintenance windows as downtime; if the SLI quietly ignores them, the number stops describing reality.

SLOs & error budgets

Why 100% is the wrong target

SLI, SLO, SLA: three letters apart, very different jobs

Choosing SLIs: measure what users feel, as a ratio

The tail-latency trap: never average your way to "fine"

The error budget: unreliability you are allowed to spend

Burn rate: alerting on the budget, not the threshold

Multi-window, multi-burn-rate: the pattern that ends alert fatigue

The budget policy: what actually happens at zero

Pitfalls worth avoiding

Further reading

04 — eBPF observability