Knowing what a system is actually doing.
Monitoring tells you the things you already knew to watch. Observability is being able to ask a new question of a running system at 3am, without shipping new code first. The whole field comes down to a few signals used well: logs, metrics, and traces, stitched together with distributed tracing so one request can be followed across a dozen services, and pointed at goals you have written down as SLOs. Get those right and most incidents turn from a guessing game into a query.
Two sub-pages are live, with two more in flight. Practical mental models for the people who get paged, not a vendor tour.
Start here.
Logs, metrics & traces
The three telemetry signals, what each is actually good at, and the costly mistake of reaching for the wrong one. Why metrics answer "is it broken", traces answer "where", and logs answer "why".
OpenTelemetry & distributed tracing
How a single request is followed across a dozen services. Trace and span context, propagation, the OpenTelemetry model, and why tracing is the one signal that survives a microservice rewrite.
Two more, in flight.
The reliability and kernel-level halves of the topic. In the order they make sense to learn:
- 03SLOs & error budgetsTurn "be reliable" into a number you can act on. SLIs, SLOs, error budgets, burn-rate alerts, and how a budget changes the conversation between the people who ship and the people who get paged.SLI/SLO · error budget · burn rate · alerting · the budget policy
- 04eBPF observabilitySee what a running kernel is doing without changing a line of application code. How eBPF safely runs sandboxed programs in the kernel, and what that unlocks for zero-instrumentation tracing, profiling, and networking.eBPF · kernel probes · zero instrumentation · continuous profiling · Cilium