Logs, metrics & traces
These are the three kinds of telemetry a running system can emit, and each answers a different question. Metrics tell you that something is wrong. Traces tell you where it is wrong. Logs tell you why. Most observability pain comes from reaching for the wrong one — and most observability bills come from reaching for the most expensive one out of habit. Get the division of labour right and a 3am page becomes a five-minute investigation instead of an hour of grepping.
Monitoring tells you what you expected; observability lets you ask new questions
It helps to start with the distinction the three signals exist to serve. Monitoring is watching the things you already knew to watch: a dashboard of CPU, error rate, and latency, with alerts on thresholds you set in advance. It is necessary and it answers "is the system healthy in the ways I anticipated." Observability is the stronger property of being able to ask a question you did not anticipate — "why are requests from this one region timing out only for logged-in users on Android" — and get an answer from data the system already emitted, without shipping new code first.
The three signals are the raw material for both. Used narrowly they give you monitoring; used well, with enough detail and the ability to slice it, they give you observability. The skill is knowing which signal carries which kind of answer, and not paying for detail you will never query.
Three questions, three signals
Imagine the checkout is slow. A metric — checkout p99 latency — is what trips the alert: a single number over time, cheap to store, perfect for dashboards and thresholds. It tells you something changed at 12:01, but not which of the eight services in the request path is responsible. A trace answers that: it follows one request across every service and shows you the timeline, so the 900 ms spent waiting on the billing call jumps out. Now you know where. A log from the billing service tells you why: a connection-pool timeout because the database was failing over.
Each signal is strong exactly where the others are weak. Trying to find a per-request root cause in metrics, or trying to compute a reliable rate from logs, is the everyday version of using the wrong tool — it sort of works, slowly and expensively. The healthy investigation almost always runs in that order: a metric alerts, a trace localises, logs explain. Internalise that arc and you will stop staring at the wrong screen during an incident.
Metrics: cheap numbers over time
A metric is a numeric measurement recorded at intervals: requests per second, error rate, queue depth, CPU. Because each data point is tiny and aggregatable, you can keep metrics for every request and still afford long retention, which makes them the right home for alerts, dashboards, and capacity trends. They are aggregate by nature, though, so they tell you about populations, not individuals — great for "the error rate doubled," useless for "what happened to order #8675309."
It pays to know the handful of metric types, because choosing the wrong one produces nonsense. A counter only ever increases (total requests, total errors); you read it by taking its rate of change over time. A gauge goes up and down and you read its current value (queue depth, memory in use, temperature). A histogram buckets observations so you can compute distributions — and this is the one that matters most for latency, because averages lie. A service can have a 50 ms average latency while one user in a hundred waits two seconds; only a histogram lets you see that p99, and tail latency is usually what users actually feel.
user_id or request_id can explode one metric into millions
of series and is the single most common way to blow up an observability bill. Keep label values
low-variety; put the high-variety detail in traces and logs, where it belongs.Traces: one request, end to end
A trace records the journey of a single request across services as a tree of timed spans — one span per unit of work, each with a start, a duration, and a parent. Lined up on a timeline, a trace shows you serial waits, fan-out, and the one slow hop in a way no aggregate can. Tracing is the signal built for microservices, and it is covered in depth in OpenTelemetry & distributed tracing.
Because keeping every span is costly, traces are usually sampled, which is its own design choice — you want to keep the interesting (slow or failed) ones, not a random tenth that happens to be all happy paths. The detail to remember here is that a trace is the bridge between the other two signals: a good setup lets you click from a spiking latency metric straight to example traces in that spike, and from a slow span straight to the logs it emitted. That linkage is what turns three separate tools into one investigation.
Logs: the detail, when you need it
A log is a timestamped record of a discrete event. The single most valuable upgrade you can make
to logging is to emit structured logs — key-value fields, usually JSON — rather
than free-text strings. A line like level=error event=payment_failed order=8675309
reason=pool_timeout latency_ms=900 can be filtered, grouped, and aggregated; the same
information baked into an English sentence can only be grepped. Structured logs are the
difference between "find every payment failure caused by a pool timeout in the last hour" being a
query versus a manual slog.
Logs carry the richest detail — the exact error, the offending input, the stack — which is what you want once a metric and a trace have pointed you at the right service and the right moment. The flip side is volume: logs are the most expensive signal to store and search at scale, so mature teams are disciplined about levels (debug in development, info and above in production), sample chatty success paths while keeping all errors, and resist the urge to log everything "just in case." A useful habit is to attach the current trace and request id to every log line, so a single click takes you from a slow trace to the exact lines it produced.
How they compare
| Signal | Answers | Granularity | Cost at scale | Watch out for |
|---|---|---|---|---|
| Metrics | Is it wrong? When? | Aggregate | Low | Cardinality blowups |
| Traces | Where in the path? | Per request | Medium (sampled) | Sampling out the interesting ones |
| Logs | Why exactly? | Per event | High | Volume, unstructured text |
The table also explains the cost order, which is worth committing to memory because it should shape instinct: metrics are cheap, traces are moderate when sampled, logs are expensive. A good rule is to push as much of your "what is happening" understanding into metrics and traces as you can, and treat logs as the place you go for the last mile of detail, not the first.
Wiring them together: correlation and exemplars
The three signals are only as good as the links between them. Two mechanisms do the wiring. Correlation ids — a trace id and request id stamped consistently into spans and log lines — let you pivot from one signal to another for the same request. Without them, you are reduced to matching on timestamps, which is hopeless under load. Exemplars go a step further: they attach a sample trace id directly to a metric data point, so when you see a latency histogram bucket light up, you can jump straight to an actual trace that landed in that bucket. The dream workflow — alert fires, click into the spike, land on a representative slow trace, click into its logs — only works if this plumbing is in place from the start.
Is "three pillars" the whole story?
A worthwhile nuance: some practitioners argue the "three pillars" framing encourages three disconnected silos, each paid for separately, when what you actually want is one richly-annotated record per unit of work that you can aggregate, trace, and read as needed. That idea — sometimes called wide structured events or "observability 2.0" — stores a fat event per request with dozens of high-cardinality fields, and derives metrics and traces from it rather than emitting three separate streams. You do not have to adopt that to benefit from the insight behind it: the value is in high-cardinality, correlated data, and the three signals are implementation details in service of being able to ask arbitrary questions later. Treat the pillars as a starting mental model, not a mandate to build three unconnected pipelines.
A worked incident
Put it together with one walk-through. At 12:01 an alert fires: checkout p99 latency crossed
800 ms (a histogram metric with an exemplar attached). You click the spike and land on trace
7af3, which shows the request spent 60 ms in the gateway, 40 ms in orders, and then sat
for 900 ms inside a span named "billing → charge." The trace has localised the problem to
one service and one operation — something no dashboard of aggregate metrics could have told you.
You follow the trace id into the billing service's structured logs and find a cluster of
event=pool_timeout lines starting at 12:00:58, each noting the connection pool was
exhausted. A glance at the database's own metrics shows a failover began at 12:00:57. Root cause
in minutes: the database failover briefly stalled connections, the pool drained, and billing
calls queued. Metrics told you something was wrong and when; the trace told you where; the logs
told you why; the correlation ids made the whole chain one click instead of three investigations.
Pitfalls worth avoiding
A few mistakes recur. Alerting on averages hides the tail; alert on percentiles from histograms instead. High-cardinality labels on metrics quietly multiply cost until a bill or a memory limit forces a reckoning — keep that detail in traces and logs. Unstructured logs turn every investigation into grep archaeology; emit key-value fields. Random trace sampling throws away the slow and failed traces you most need; prefer keeping the interesting ones. And missing correlation ids leave you matching signals by timestamp under load, which does not work. None of these are exotic — they are the difference between observability that earns its cost and a pile of telemetry you cannot actually use.
Further reading
- OpenTelemetry — Signals — the vendor-neutral definitions of metrics, traces, and logs and how they relate.
- Honeycomb blog — on high-cardinality observability — the clearest argument for why per-request, high-cardinality data matters and where the "three pillars" framing falls short.
- Prometheus — metric types — counters, gauges, histograms, and summaries, with the rationale for each.
- Semicolony — the RED method — rate, errors, duration: which metrics to put on a service dashboard in the first place.
- Semicolony — OpenTelemetry & tracing — how the trace that anchors an investigation is actually produced.