04 / 04
Observability / 04

eBPF observability

Everything a program does eventually passes through the kernel — every syscall, every packet, every disk read, every scheduler decision. eBPF lets you attach small, verified programs to those points and watch the whole machine from below, without changing a line of application code or loading a kernel module that could take the box down. That is why it went from a packet-filter curiosity to the substrate under half the modern observability industry.


What eBPF actually is

eBPF is a small virtual machine inside the Linux kernel that runs event-driven programs you load from userspace. You write a program (usually in restricted C, compiled to eBPF bytecode), attach it to a hook — a function entry, a tracepoint, a network device — and from then on the kernel runs your program every time that event fires, JIT-compiled to native code. The name is historical: BPF was the 1992 Berkeley Packet Filter that powered tcpdump; "extended" BPF, merged from 2014 onward, generalised it from filtering packets to running programs on nearly any kernel event. The packet-filter heritage is now trivia — eBPF today is closer to "a safe scripting engine for the kernel."

The comparison that makes the design click: kernel modules can also see everything, but a bug in a module is a bug in the kernel — one bad pointer and the machine panics. eBPF's wager is that if a program can be proven harmless before it runs, you can grant it kernel-side visibility without kernel-side blast radius. That proof is the verifier's job, and it is the single load-bearing idea in the whole system.

The verifier: why this is allowed in production

Before the kernel accepts an eBPF program, the verifier statically analyses every path through it and rejects anything it cannot prove safe. The guarantees are concrete: the program must terminate (no unbounded loops — bounded loops were only admitted once the verifier could prove their bounds), every memory access must be provably in range, kernel memory can only be read through helper functions, no arbitrary kernel writes, and the program runs with no blocking and a bounded stack. Fail any check and the load fails — the program simply never runs.

This is a different trust model from everything else that runs on the box. Your application is trusted because you wrote it; a kernel module is trusted because you really hope it is correct. An eBPF program is machine-checked on every load. The practical consequence is the one that matters for this page: an eBPF observability agent can crash itself, but it cannot crash the kernel, scribble over kernel memory, or wedge a core in an infinite loop. That is why operators who would never load a vendor's kernel module will run a vendor's eBPF agent.

The honest caveat. "Cannot crash the kernel" is not "free." A verified program still costs cycles every time its event fires, and verifier bugs have existed (they are treated as serious kernel CVEs, which is also why unprivileged eBPF is disabled almost everywhere). Safe means "won't take the box down," not "attach to anything without thinking."

Maps: how data gets out, and why the overhead stays low

An eBPF program runs in the kernel, but you are sitting in userspace wanting numbers. The bridge is maps — kernel-resident data structures both sides can read and write: hash maps, arrays, per-CPU variants, ring buffers for streaming events, and specialised ones like stack-trace maps for profilers.

Maps are also the performance story, not just the plumbing. The naive design — ship every event to userspace and aggregate there — drowns in its own copying on a hot path that fires a million times a second. The eBPF idiom inverts it: aggregate in the kernel, export summaries. A latency tool increments histogram buckets in a map on every I/O and userspace reads the finished histogram once a second; a million events cross the boundary as a few dozen numbers. Per-CPU maps push it further by giving each core its own copy so hot-path updates never contend on a shared cache line. When you do need individual events (a process exec, a new connection), the ring buffer streams them — but well-built tools send the rare interesting event, not the torrent.

Attach points: kprobes, uprobes, and tracepoints

Where you can attach determines what you can see. Three families cover most observability work:

HookAttaches toStabilityTypical use
kprobe / kretprobe(Almost) any kernel function's entry / returnUnstable — follows kernel internalsAd-hoc digging: file opens, TCP retransmits, lock waits
tracepointStatic, maintained markers in kernel codeStable API across kernel versionsSyscalls, scheduler events, block I/O — the durable stuff
uprobe / USDTUserspace function entry/return, or static app markersTracks the binary's symbolsTracing libssl, malloc, language runtimes — no code change

The stability column is the one to internalise. A kprobe can hook nearly any function in the running kernel, which is enormous power with no contract: the function you hooked may be renamed, inlined, or restructured in the next kernel release, and your tool silently breaks. Tracepoints are the opposite trade — a curated set of markers kernel developers commit to keeping stable, so production tooling prefers them wherever one exists, falling back to kprobes for everything else. Uprobes extend the same trick into userspace binaries — hook SSL_write in OpenSSL and you read HTTPS payloads before encryption, hook a function in your own binary and you trace it live — at a noticeably higher per-event cost than kernel-side hooks, since each hit traps into the kernel. Beyond these three sit the perf-event hooks that drive sampling profilers and the networking hooks (XDP, tc, socket and cgroup programs) that let eBPF process packets — the foundation Cilium is built on.

CO-RE: compile once, run everywhere

eBPF programs read kernel data structures, and kernel data structures change between versions — a field moves, a struct grows, an offset shifts. The first-generation answer (the bcc toolchain) was to ship LLVM and kernel headers to every host and compile the eBPF program on the box at load time, against that machine's exact kernel. It worked, at the cost of a hundreds-of-megabytes runtime dependency, slow startup, and a compiler running on production hosts.

CO-RE (compile once, run everywhere) is the modern fix. Kernels now ship BTF — type information describing their own structs — and CO-RE programs are compiled once with relocatable field accesses. At load time, libbpf reads the running kernel's BTF and patches each access to the right offset for this kernel. The result is a small, self-contained binary that runs across kernel versions with no compiler and no headers on the host. This sounds like packaging minutiae, but it is the change that made eBPF shippable as a product rather than a toolkit for kernel-adjacent specialists — every serious agent (Cilium, Pixie, Parca, Datadog's and Grafana's eBPF collectors) is CO-RE-based now.

The tooling landscape

The ecosystem stacks in rough layers, from one-liners to platforms:

bcc is the original toolkit — a library plus dozens of ready-made tools (execsnoop, biolatency, tcpretrans, opensnoop) that are still the fastest way to answer "what is this box doing right now." bpftrace is the one-liner language on top of the same machinery — awk for kernel events. One line at a shell prompt gets you a live histogram of read latencies:

bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; }
  kretprobe:vfs_read /@start[tid]/
  { @ns = hist(nsecs - @start[tid]); delete(@start[tid]); }'
bpftrace: timestamp on entry, histogram of elapsed time on return. The kernel keeps the histogram; you just read it.

Cilium is eBPF as the network layer itself — a Kubernetes CNI that implements routing, load balancing, and network policy in eBPF instead of iptables — and Hubble is the observability surface on top: per-flow visibility into which pod talked to which service, with what protocol and verdict, with no sidecars and no application change. Pixie aims at application observability: its uprobes on TLS libraries and protocol parsers on socket data reconstruct HTTP, gRPC, and database requests for every pod, instantly, with no SDK. Parca (and the eBPF agents of Pyroscope and others) does continuous profiling — sampling stacks across the whole fleet at ~100 Hz, all day, cheap enough to leave on, so "why is this service burning 40% more CPU than last Tuesday" becomes a diff between two flame graphs rather than an archaeology project. The same probe machinery also powers the security crowd — Falco and Cilium Tetragon watch syscalls for runtime threat detection — which tells you how general the substrate is.

Zero-instrumentation observability — and its honest limits

The pitch writes itself: deploy one agent, get HTTP latencies, network flows, CPU profiles, and syscall activity for every workload on the node — including the third-party binary you cannot recompile and the legacy service nobody dares touch. No SDK, no redeploy, no per-language instrumentation effort, and one agent per node rather than code in every process. For infrastructure-shaped questions, this is genuinely as good as it sounds.

But the ceiling is structural, not a missing feature. The kernel sees events, not intent. Specifically:

No business context. eBPF can see a 900 ms POST to /checkout; it cannot know the cart value, the customer tier, or the feature flag that selected the slow code path. Those attributes exist only inside your application, and only explicit instrumentation can attach them. Trace causality is guesswork. Distributed tracing works because instrumented services propagate context into outgoing requests. An eBPF agent watching sockets sees that a request came in and three calls went out; it can correlate them with timing and thread heuristics, but it cannot prove which inbound request caused which outbound call the way a propagated trace id can. Encryption gets in the way. On the wire, TLS hides payloads; tools recover them with uprobes on the TLS library, which works until the runtime statically links, ships its own crypto, or changes symbols. In-process blindness. Function-level visibility inside JIT-compiled and interpreted runtimes (JVM, Python, Node) is far weaker than in compiled binaries — the kernel sees the interpreter, not your function names, unless extra symbol machinery fills the gap.

So the mature position is complement, not replacement: eBPF gives you the floor — every process, every node, no code changes — and OpenTelemetry gives you the meaning — business attributes and real causal traces where you have invested in instrumentation. Teams that frame it as either/or end up missing one half of their incidents.

The overhead and safety story, in one place

Because "run my code in your kernel on every packet" is an alarming sentence, it is worth stating the full production argument compactly. Safety: the verifier proves termination and memory safety before load; helpers gate all kernel access; a buggy program is rejected, not crashed into. Overhead: events cost roughly tens to a few hundred nanoseconds of JIT-compiled work each, so cost is driven by event frequency — a sampling profiler at 100 Hz per core or a probe on process-exec is negligible; a probe on every scheduler switch or every packet on a saturated NIC is a measurable tax you should benchmark first. In-kernel aggregation keeps the boundary-crossing cost out of the hot path. Production deployments of the major agents typically land in the low single digits of percent CPU, which is competitive with — often cheaper than — the in-process instrumentation it partially replaces.

The operational caveats are about privilege and fleet management rather than stability: loading observability probes requires elevated capabilities (CAP_BPF / CAP_PERFMON, or root), so the agent itself is a sensitive component; kprobe-based tools need testing against your kernel versions because internals move; and very old kernels (pre-5.x, common in long-tail enterprise fleets) lack the BTF and ring-buffer features modern CO-RE tooling assumes.

Further reading

Found this useful?