02 / 04

Observability / 02

OpenTelemetry & distributed tracing

In a monolith a stack trace tells you the whole story. Split that monolith into twenty services and the story is now scattered across twenty machines, each with its own clock and its own logs. Distributed tracing stitches it back together: one id follows the request everywhere it goes, and every service it touches records its slice of the timeline against that id.

The problem tracing solves

Start with the failure mode tracing exists to fix. A user clicks "place order" and waits four seconds for a page that usually loads in half a second. In a monolith you would open a profiler or read one log file and watch the call go down the stack and back up; the whole story lives in one process on one machine with one clock. Split that same flow across a gateway, an auth service, an orders service, a billing service, an inventory service, and three databases, and the story shatters. Each service logged its piece, but the pieces have different request ids, different log formats, and clocks that drift by tens of milliseconds. Reassembling "what happened to that one request" from those scattered logs is an archaeology project, and you have to do it while the page is on fire.

Distributed tracing fixes this by deciding, at the very first hop, on a single id for the request and carrying it everywhere the request goes. Every service that touches the request records its slice of work against that same id. Collect all those slices afterwards and you can rebuild the request's entire journey on one timeline: who called whom, what ran in parallel, which call was slow, and where it failed. The four-second order becomes a picture, and the picture usually answers the question in seconds. The companion page on logs, metrics and traces frames the three signals together; this page is the deep dive on the one that answers "where did the time go, and which hop broke."

Spans: the unit everything is built from

A trace is a tree of spans, and a span is one timed unit of work. An incoming HTTP request is a span. A database query is a span. An outbound call to another service is a span. A chunk of CPU work you decided was worth measuring is a span. Each one records a small, fixed set of fields, and once you know those fields you understand the whole data model.

Trace id — a 16-byte id shared by every span in the same request. This is the thread that ties the whole tree together.
Span id — an 8-byte id unique to this one span.
Parent span id — the span id of whatever caused this work. A root span (the very first one) has no parent.
Name — a low-cardinality label like GET /orders/:id or SELECT orders, chosen so similar operations group together.
Start time and duration — when the work began and how long it took. Duration is the number you stare at most.
Attributes — key/value pairs that describe this span: http.method, http.status_code, db.system, user.tier, and so on.
Events — timestamped points inside the span, like "cache miss" or "retry 2", for things that happen but do not deserve a child span.
Status — ok, error, or unset, so a backend can find the failed spans without parsing attributes.

The parent pointers are what turn a flat pile of spans into a tree. Each span knows its own id and its parent's id, so a backend that receives a few thousand spans with the same trace id can reconstruct the exact shape of the call graph by matching children to parents. Order the children by start time and lay them out against a clock and you get a waterfall, which is the view you will live in.

The same trace two ways: the parent/child tree on the left, the time waterfall on the right. Nesting in the waterfall means "ran inside"; the long bar at the bottom is the real culprit.

Reading a waterfall is a skill worth naming. A bar that sits almost entirely inside its parent's bar means the parent spent that time waiting on the child. Two sibling bars that overlap in time ran in parallel; two that sit end to end ran in sequence and may be a place to parallelise. A parent whose bar is much wider than the sum of its children has unaccounted time — work it did itself, or a gap where it was blocked on something nobody instrumented. The single most common "aha" is finding one leaf span, often a database call or a downstream service, that owns most of the parent's duration. That is the thing to fix.

Context propagation: the part that breaks

Everything above assumes every service tags its spans with the same trace id and the right parent span id. The mechanism that makes that happen across process and network boundaries is context propagation, and it is where most real tracing problems live. When service A calls service B, A has to hand B two things: the trace id (so B's spans join the same trace) and A's current span id (so B's root span knows its parent). B reads them, starts its work as a child, and when B calls C it passes its context along in turn. The trace grows hop by hop because each hop forwards the baton.

Over HTTP that baton is a header. The W3C Trace Context standard defines traceparent, a single line with four hyphen-separated fields:

The traceparent header: version, the 16-byte trace id, the 8-byte parent span id, and a flags byte (the low bit is "sampled"). B reads it and continues the trace.

The flags byte matters more than it looks. Its low bit is the sampled flag, the upstream's decision about whether this trace will be recorded. When propagation works, that one bit travels with the request so every service makes the same keep-or-drop choice and you never get a half-recorded trace. There is a companion header, tracestate, that carries vendor-specific key/values, and a separate concept, baggage, for application data you want every downstream service to see — more on that below.

A broken trace is almost always broken propagation. If a trace stops dead at a service boundary, suspect a hop that failed to forward traceparent: an un-instrumented HTTP client, a reverse proxy that strips unknown headers, a message queue that does not carry headers on the message, or a thread/goroutine hand-off that lost the in-process context. The fix is almost never in the tracing backend; it is at the boundary where the baton was dropped.

In-process propagation is its own quiet trap. Within a single service, the "current span" is held in a context object — Go's context.Context, a thread-local or async-local in most other runtimes. If you spawn a background task, hand work to a thread pool, or fire an async callback without carrying that context across, the child work starts with no parent and either lands as an orphan trace or vanishes. Asynchronous code and queues are where most home-grown propagation quietly fails, which is the main reason to use the official instrumentation rather than rolling your own header plumbing.

What OpenTelemetry actually is

OpenTelemetry, usually written OTel, is the vendor-neutral standard for producing telemetry — traces, metrics, and logs. It won the space because it separates how you instrument from where the data ends up, which means the tracing code you write does not marry you to a single vendor. The standard has three parts that are worth holding apart in your head, because confusing them is the source of a lot of fuzzy thinking.

The API — the surface your application code calls. It defines what a span is, how you start and end one, how you set attributes. It deliberately does almost nothing on its own: if no SDK is wired in, API calls are cheap no-ops, so a library can depend on the API without forcing a tracing setup on its users.
The SDK — the implementation you install at the edge of your app. It does the real work: deciding sampling, batching spans, attaching resource attributes (service name, version, host), and handing finished spans to an exporter.
The Collector — a standalone process, separate from your app, that receives telemetry, processes it (filtering, batching, redaction, tail sampling), and exports it onward to one or more backends.

The API/SDK/Collector split. Your code speaks OTLP to the Collector; switching backends is a config change in the Collector, not a code change in the app.

The wire format between all of these is OTLP, the OpenTelemetry Protocol. Your SDK exports OTLP to the Collector; the Collector can export OTLP again, or translate to whatever a particular backend wants. The payoff is real portability. Instrument once against the OTel API, run the Collector, and you can move from one tracing backend to another, or fan out to several at once, by editing Collector config and restarting one process. None of your application code changes, which is exactly what you want when a vendor contract expires or a cost review forces a switch.

A fair question is whether you even need the Collector. You can export straight from the SDK to a backend, and for a tiny system that is fine. The Collector earns its place once you have many services: it gives you one spot to do tail sampling (which needs a full view of each trace), to scrub PII before it leaves your network, to add or rename attributes consistently, to buffer during backend outages, and to shield your apps from knowing any backend's address. It is the seam that keeps the rest of the system loosely coupled.

Instrumentation: automatic and by hand

Spans do not appear by magic; something has to call the API at the start and end of each unit of work. That "something" comes in two flavours, and a healthy system uses both.

Automatic instrumentation is the fast win. Language-specific agents and library integrations wrap the framework code you already use — the web framework's request handler, the HTTP client, the database driver, the message-queue client — so that incoming requests, outbound calls, and queries become spans without you writing tracing code. These integrations also handle propagation: the HTTP client integration injects traceparent into outbound requests, and the server integration extracts it from incoming ones. In Java you can often attach an agent at startup and get a useful trace on the first request. This is where you should start, because it gets propagation right at every standard boundary, which is the part that is easy to get wrong by hand.

Manual instrumentation is how you add the business meaning that the framework cannot know. Auto-instrumentation will tell you a request spent 200ms in a handler; it will not tell you that 150ms of it was your pricing-rules evaluation, because "pricing rules" is your concept, not the framework's. So you wrap the parts that matter in your own spans, give them clear names, and attach attributes that turn a trace into a debugging tool: the customer tier, the feature flag in effect, the number of items in the cart, the cache hit or miss. Good manual spans are what let you ask "are slow traces correlated with enterprise customers?" months later. The rule of thumb is: let auto-instrumentation cover the plumbing, and add manual spans and attributes for the decisions and the domain logic you will want to slice by.

Attribute cardinality is a real cost. Attributes like user id or full URL with query string have effectively unbounded distinct values. They are fine on spans (you query them ad hoc), but if you also turn spans into metrics — see span metrics below — every distinct value becomes a new time series, and high-cardinality labels are how you accidentally melt a metrics backend. Keep span names low-cardinality; push the specifics into attributes.

Baggage: carrying data with the request

Trace context answers "which trace and which parent." Baggage answers a different need: arbitrary key/value data you want every service downstream to be able to read, carried alongside the trace context in its own header. Set customer.tier = enterprise at the gateway, and the billing service three hops later can read it and tag its own spans with it, even though billing never received that value as a normal request parameter. It is a way to thread context through a call chain without changing every intermediate service's API.

Two cautions keep baggage useful rather than dangerous. First, it is sent on every hop as a header, so a fat baggage payload taxes every request; keep it small. Second, baggage crosses trust boundaries with the request, so never put secrets or anything sensitive in it, and be careful about accepting baggage from outside your perimeter. Used with restraint — a tier flag, a tenant id, a canary marker — it is a clean way to make a slice of business context available everywhere a request travels.

Sampling: you cannot keep every trace

A system doing tens of thousands of requests a second produces an absurd volume of spans. Storing all of them is wasteful and expensive, and the overwhelming majority are identical boring successes that tell you nothing. So you sample — keep some traces, drop the rest. The interesting design question is when you decide, and the two answers have opposite strengths.

Strategy	Decides…	Trade-off
Head sampling	At the first hop, before the outcome is known. The decision rides in the `traceparent` flags so the whole trace agrees.	Cheap, simple, no buffering. But it is blind to outcome, so it drops the slow and failed traces you most wanted to keep.
Tail sampling	After the trace finishes, when its duration and status are known. Done in the Collector.	Keeps the traces that matter — errors, slow ones, rare paths. But the Collector must buffer all spans of a trace until it is complete, which costs memory and needs every span to reach the same Collector.

Head sampling commits before it knows how the request went, so it can throw away the exact slow trace you needed. Tail sampling waits for the verdict.

A common production setup is tail sampling in the Collector with a few policies layered: keep 100% of traces that errored, keep 100% that exceeded a latency threshold, keep a low percentage of plain successes for a baseline, and maybe always keep traces touching a specific tenant during a launch. That gives you the traces that matter during an incident without paying to store millions of identical happy paths. The price is operational: tail sampling needs all of a trace's spans to land at the same Collector instance, which constrains how you route and scale Collectors, and it needs enough memory to buffer in-flight traces for their lifetime.

The classic pitfall is over-sampling — in both directions. Sample too aggressively and during an incident the one trace you need was thrown away. Sample too little and your storage bill, and sometimes your application's own overhead, balloon. The fix is rarely a single global rate; it is outcome-aware policy: cheap on the boring path, generous on errors, slow requests, and anything you are actively investigating.

Span metrics: counting without re-instrumenting

Spans and metrics feel like separate worlds, but spans already contain everything you need to compute the core service-health metrics. Every span has a name, a duration, and a status, which is exactly request rate, latency, and error rate — the RED signals. The Collector can derive these span metrics automatically: count spans per name for rate, build latency histograms from durations, and count error-status spans for the error rate. You get dashboards and alerts for free off the same instrumentation that produced your traces, and the numbers line up with the traces because they came from the same source.

This is where the cardinality warning above bites hardest. A metric is a separate time series per unique combination of labels, so the dimensions you derive span metrics on must stay low-cardinality — service, operation name, status — not user id or raw URL. The discipline is the same one that keeps span names clean: stable, bounded names for the metric dimensions, rich specifics left as span attributes you query when you actually open a trace.

A worked example, start to finish

Walk one request through the whole machine. A user submits an order. The request hits the gateway, whose auto-instrumentation sees no incoming traceparent, so this is a root: it mints a fresh trace id 7af3… and a span id, opens a span named POST /checkout, and makes the sampling decision (say tail sampling, so the flags carry "record for now"). It then calls the orders service.

The gateway's HTTP-client integration injects traceparent: 00-7af3…-<gateway-span>-01 into that outbound call. The orders service extracts the header, opens a span whose parent is the gateway's span and whose trace id is still 7af3…, and does its work. It calls auth (a short child span, 22ms) and then billing, again forwarding traceparent on each hop. Billing opens its span and runs a database query, which the database driver integration wraps in its own child span. That query hits a replica that is mid-failover and stalls for 200ms before succeeding.

Each service's SDK batches its finished spans and exports them over OTLP to the Collector. The Collector buffers all spans sharing trace 7af3… until the trace completes, then applies its tail-sampling policy. Total duration is 410ms, above the "slow" threshold, so the policy keeps the whole trace and forwards it to the tracing backend. The Collector also rolls the same spans into span metrics, bumping the request count and latency histogram for each operation.

Now you open the trace. The waterfall shows the gateway span as the outer bar, orders nested inside it, auth as a short blip, and billing as a wide bar whose width is almost entirely the database child span. The story reads itself off the picture: the request was slow because a single database query waited on a replica failover. No log archaeology, no clock-skew guessing — one trace, one obvious culprit. And because billing's span carried attributes like db.system and the replica host, you can immediately check whether other slow traces share that host, turning a single incident into a pattern.

Common pitfalls, named directly

Most tracing pain falls into a handful of recurring shapes, and naming them makes them easy to spot.

Broken propagation. The trace stops at a boundary because a hop did not forward traceparent. The usual suspects are an un-instrumented HTTP client, a proxy stripping headers, or a queue that drops them. Symptom: orphan traces that start mid-system.
Lost context across async and queues. A background job, thread-pool task, or message consumer starts without the parent context, so its work either disappears or becomes a disconnected trace. Message queues need explicit propagation: inject the context into the message on publish, extract it on consume.
Over-sampling. A blunt global sample rate that throws away the slow and failed traces you needed. Make sampling outcome-aware instead.
High-cardinality span names. Putting the order id in the span name rather than an attribute, so every request is its own "operation" and grouping breaks. Keep names like GET /orders/:id; put the id in an attribute.
Clock skew confusion. Each service stamps its own spans with its own clock, so a child can appear to start slightly before its parent, or two spans on different hosts can look mis-ordered by a few milliseconds. Trust the parent/child structure over the raw timestamps when they disagree at the margins.
Missing manual spans. Auto-instrumentation shows the plumbing but leaves your domain logic as one opaque block, so the trace shows "200ms in the handler" with no breakdown. Add spans around the parts you will want to explain.

Why tracing survives a rewrite

Architectures churn constantly. Services split when they get too big, merge when they were split too far, get rewritten in another language for performance, move from VMs to containers to serverless. Through all of it, a request still has to flow through the system, and a trace still describes that flow. The shape of the call graph changes, but the idea of "one id following one request, each hop recording its slice" does not.

Because OpenTelemetry is a standard rather than a single product, the value compounds. The instrumentation you add against the OTel API outlives the backend you happened to choose this year, and even outlives a language change, because every OTel SDK speaks the same data model and the same wire format. You can rewrite a service in Go, keep the same span names and attributes, and the new service slots into existing traces and dashboards as if nothing happened. Of the three signals, tracing is the one that keeps paying off as the system underneath it shifts, which is why it has become the backbone of debugging distributed systems and the first thing worth wiring up on a new one. For the wider picture of how it fits with logs and metrics, the observability index ties the track together.

OpenTelemetry & distributed tracing

The problem tracing solves

Spans: the unit everything is built from

Context propagation: the part that breaks

What OpenTelemetry actually is

Instrumentation: automatic and by hand

Baggage: carrying data with the request

Sampling: you cannot keep every trace

Span metrics: counting without re-instrumenting

A worked example, start to finish

Common pitfalls, named directly

Why tracing survives a rewrite

Further reading

Further reading

03 — SLOs & error budgets