Multi-page · for engineers on call
Observability

Knowing what a system is actually doing.

Monitoring tells you the things you already knew to watch. Observability is being able to ask a new question of a running system at 3am, without shipping new code first. The whole field comes down to a few signals used well: logs, metrics, and traces, stitched together with distributed tracing so one request can be followed across a dozen services, and pointed at goals you have written down as SLOs. Get those right and most incidents turn from a guessing game into a query.

Two sub-pages are live, with two more in flight. Practical mental models for the people who get paged, not a vendor tour.


Live deep dives

Start here.

Planned deep dives

Two more, in flight.

The reliability and kernel-level halves of the topic. In the order they make sense to learn:

  1. 03
    SLOs & error budgets
    Turn "be reliable" into a number you can act on. SLIs, SLOs, error budgets, burn-rate alerts, and how a budget changes the conversation between the people who ship and the people who get paged.
    SLI/SLO · error budget · burn rate · alerting · the budget policy
  2. 04
    eBPF observability
    See what a running kernel is doing without changing a line of application code. How eBPF safely runs sandboxed programs in the kernel, and what that unlocks for zero-instrumentation tracing, profiling, and networking.
    eBPF · kernel probes · zero instrumentation · continuous profiling · Cilium