Knowing what a system is actually doing.

Monitoring tells you the things you already knew to watch. Observability is being able to ask a new question of a running system at 3am, without shipping new code first. The whole field comes down to a few signals used well: logs, metrics, and traces, stitched together with distributed tracing so one request can be followed across a dozen services, and pointed at goals you have written down as SLOs. Get those right and most incidents turn from a guessing game into a query.

All four sub-pages are live. Practical mental models for the people who get paged, not a vendor tour.

Live deep dives

Start here.

01 Live

Logs, metrics & traces

The three telemetry signals, what each is actually good at, and the costly mistake of reaching for the wrong one. Why metrics answer "is it broken", traces answer "where", and logs answer "why".

the three pillars ·cardinality ·sampling ·cost ·when to use which

Read

02 Live

OpenTelemetry & distributed tracing

How a single request is followed across a dozen services. Trace and span context, propagation, the OpenTelemetry model, and why tracing is the one signal that survives a microservice rewrite.

spans ·context propagation ·OTel ·sampling ·instrumentation

Read

03 Live

SLOs & error budgets

Turn "be reliable" into a number you can act on. SLIs, SLOs, error budgets, burn-rate alerts, and how a budget changes the conversation between the people who ship and the people who get paged.

SLI/SLO ·error budget ·burn rate ·alerting ·the budget policy

Read

04 Live

eBPF observability

See what a running kernel is doing without changing a line of application code. How eBPF safely runs sandboxed programs in the kernel, and what that unlocks for zero-instrumentation tracing, profiling, and networking.

eBPF ·kernel probes ·zero instrumentation ·continuous profiling ·Cilium

Read

Start here

Logs, metrics & traces

The three signals, what each is good at, and the expensive mistake of reaching for the wrong one. Cardinality, sampling, cost, and a working rule for when to use which — before any of the tooling makes sense.

Knowing what a system is actually doing.

Start here.

Logs, metrics & traces

OpenTelemetry & distributed tracing

SLOs & error budgets

eBPF observability

Where this connects.