20 / 28

Linux / 20

perf

top told you which process is burning CPU. The next question is harder and more useful: which function? Which loop, which lock, which memcpy is the processor actually executing, right now, instruction by instruction? That is the question perf answers, by sampling the machine thousands of times a second and recording what it catches. This page covers the five invocations worth knowing, reads a perf top and a perf stat output line by line, walks three production hunts, explains flame graphs and why stacks come out broken in containers, and ends with a drill that will not hurt anything.

The question it answers

Every tool earlier in this sequence works at the granularity of a process. top and htop tell you that PID 41327 is using 340% CPU. What's eating my CPU? walks the decision tree from a hot machine to a guilty process. But a process is a big, opaque thing. Saying "java is using the CPU" is like saying "the building is using the electricity" — true, unhelpful. The work happens in functions, and the question that actually leads to a fix is which function the processor spends its cycles in.

perf answers that question by sampling. Many times per second, on every CPU, it arranges for an interrupt that asks one tiny question: what instruction is executing right this instant, and in which function does that instruction live? Each answer is one sample. Collect a few thousand and the histogram of where the samples landed is, statistically, a map of where the CPU time goes. If 31% of samples land inside hash_lookup, the machine spends roughly 31% of its cycles there. No instrumentation, no recompiling, no restarting the service. You point it at the whole system or one PID and it tells you what the silicon is doing.

It does more than sample, which is both its strength and the reason its man pages feel endless. The same tool reads the CPU's hardware counters (how many instructions, how many cache misses, how many branch mispredictions), traces kernel events, and annotates assembly. You will use a small fraction of it. The fraction below covers most CPU investigations a working engineer runs, and the profiling page covers where this kind of measurement sits in the wider performance method.

The five invocations that matter

perf is a multiplexer: perf top, perf record, perf report, perf stat, and a few dozen more subcommands you can ignore for years. Five invocations cover the daily work.

Invocation	What it does	When you reach for it
`perf top`	Live, whole-system function histogram, updated like top	First look at a hot machine: "what is everyone doing?"
`perf top -p 41327`	The same live view, narrowed to one process	You already know the PID and want its hot functions now
`perf record -g -p 41327 -- sleep 30`	Records 30 seconds of samples with call stacks to perf.data	Anything you want to study, share, or turn into a flame graph
`perf report`	Interactive browser over a recorded perf.data	Reading what record captured, drilling into call chains
`perf stat -- cmd`	Counter totals for one run: cycles, instructions, IPC, cache misses	Before/after comparisons, "is this workload CPU-bound or memory-bound?"

Two details in the record line are worth spelling out. -g asks for call stacks with every sample, not just the instruction pointer — without it you learn that memcpy is hot but not who called it, and the answer to "who called it" is usually the whole investigation. The trailing -- sleep 30 is a timer trick: record runs until the command after -- exits, so attaching to a PID and running sleep 30 gives you a clean thirty-second window. You can also wrap a command directly, perf record -g -- ./myserver --bench, and record exactly that run.

The fifth invocation has a sibling that opens the rest of the tool: perf record -e selects which event drives the sampling. The default event is CPU cycles, which answers "where does time go." But you can sample on cache-misses to find the code that misses the cache, on branch-misses for misprediction hotspots, or on kernel tracepoints like block:block_rq_issue to catch who submits disk I/O. perf list prints the menu, which on a modern CPU runs to hundreds of entries. You will not need most of them, but knowing the door exists changes what kind of questions you think to ask.

The frequency default is deliberate. perf record samples around 4000 times per second per CPU by default; tools and guides often use -F 99 — 99 Hz — for long captures. The odd number is not superstition. Sampling at exactly 100 Hz risks running in lockstep with timers and periodic work that also fire at round frequencies, so every sample would catch the same phase of the same loop. 99 Hz drifts relative to everything periodic and samples fairly.

Reading the output

Here is a perf top from a busy box running a Java service behind nginx. Run it as root; the pitfalls section covers what happens when you do not.

$ sudo perf top
Samples: 48K of event 'cycles:P', 4000 Hz, Event count (approx.): 31204418329
Overhead  Shared Object             Symbol
  14.21%  libjvm.so                 [.] SpinPause
   9.87%  perf-41327.map            [.] Lcom/acme/cache/ShardMap;::lookup
   7.45%  [kernel.kallsyms]         [k] copy_user_enhanced_fast_string
   5.12%  libc.so.6                 [.] __memmove_avx_unaligned_erms
   3.96%  [kernel.kallsyms]         [k] _raw_spin_lock
   2.31%  nginx                     [.] ngx_http_parse_header_line
   1.88%  libjvm.so                 [.] 0x00000000007c41b3
   1.40%  [kernel.kallsyms]         [k] tcp_sendmsg_locked

Three columns. Overhead is the fraction of samples that landed in this symbol — not a measured duration, a share of the histogram. The percentages are relative to the samples collected, so 14.21% means "of everything every CPU was doing during this window, one seventh of the caught moments were inside SpinPause." Shared Object is which binary or library the instruction belonged to, and it carries more diagnostic weight than it looks like it should. Symbol is the function name, when perf can resolve one.

The shared object column is a sorting hat. [kernel.kallsyms] means the sample landed in kernel code — the [k] tag on the symbol says the same thing — so this CPU time will show up as system time, not user time, in top. A profile dominated by kernel symbols says the process is making the kernel work: copying data to and from user space (copy_user_enhanced_fast_string is the kernel's bulk copy, usually driven by read/write syscalls), contending on spinlocks, running the network stack. A file like perf-41327.map is a JIT map: runtimes that compile code at runtime — the JVM, V8, some Python JITs — can write a map file telling perf which generated code lives at which address, which is how a Java method name shows up in a profile of native instruction pointers. And a bare hex address like 0x00000000007c41b3 where a name should be means perf found code it cannot name: no symbols for that region. One stray row of hex is cosmetic; a profile that is mostly hex is unusable, and the pitfalls section covers the fixes.

Now the counter view. perf stat does no sampling at all; it programs the CPU's hardware counters at the start of a command, reads them at the end, and prints totals.

$ perf stat -- ./report-builder --month 2026-05

 Performance counter stats for './report-builder --month 2026-05':

          4,182.41 msec task-clock                #    0.992 CPUs utilized
            14,208,419,206      cycles            #    3.397 GHz
             6,391,532,118      instructions      #    0.45  insn per cycle
             1,242,816,004      branches          #  297.152 M/sec
                18,420,711      branch-misses     #    1.48% of all branches
               412,396,221      cache-references
               198,123,460      cache-misses      #   48.04% of all cache refs

       4.215042512 seconds time elapsed

The line to read first is insn per cycle, IPC. A modern out-of-order core can retire four or more instructions every cycle when the pipeline is fed; this run managed 0.45. That gap is the story. Low IPC means the core spent most of its cycles stalled, waiting for something — and the cache-misses line two rows down names the something: nearly half of all cache references missed, so the core sat idle while loads crawled out to main memory. This program is not compute-bound, it is memory-bound, and making its arithmetic faster will do nothing. Conversely an IPC of 2 or 3 with low miss rates says the core is genuinely busy executing, and the only way to go faster is to execute less. One number, and it redirects the whole optimisation effort. The systematic version of this reading — start at cycles, ask whether they retired work or stalled, then ask why — is the top-down method, and the machinery being measured, the fetch-decode-execute pipeline these counters watch, is the subject of the instruction cycle.

Branch misses get the same treatment: 1.48% is healthy, while several percent on a hot path means the core keeps guessing wrong about which way the code goes and throwing away speculative work. The general rule for perf stat is that the absolute numbers mean little on their own; the ratios (IPC, miss percentages) and the deltas between two runs are where the information lives.

Three production scenarios

The hot function in a busy service

Latency on a service crept up over a week of deploys and CPU per request has roughly doubled. Nothing in the metrics says why. This is the canonical record-then-report hunt:

$ sudo perf record -g -p 41327 -- sleep 30
[ perf record: Woken up 142 times to write data ]
[ perf record: Captured and wrote 38.412 MB perf.data (412086 samples) ]
$ sudo perf report --stdio | head -20
# Overhead  Command  Shared Object   Symbol
    38.41%   server   server          [.] validate_row
            |
            --- validate_row
                |--96.2%-- render_table
                |          build_response
                |          handle_request
                 --3.8%-- import_batch

With -g the report shows not just that validate_row eats 38% of the CPU but that 96% of the calls into it come through render_table. That call chain is the diagnosis: someone made rendering re-validate every row, probably as a defensive check that was cheap on the test dataset. Without stacks you would know the what but not the from where, and the from-where is what tells you which code to change. For profiles bigger than a screen, turn the same data into a flame graph — covered two sections down — where this exact pattern shows up as one wide tower you can see across the room.

CPU is high but my process looks idle-ish

The host graph says 80% CPU. top says your service is using 35% and nothing else is using much, and the columns that explain the difference are sy, hi and si in the header — system time, hardware interrupts, softirqs. None of that time is attributed to a user process in the per-process list, so the per-process list lies by omission. perf top does not, because it samples whatever the CPU is doing, including the kernel acting on nobody's behalf in particular:

$ sudo perf top
Overhead  Shared Object        Symbol
  11.62%  [kernel.kallsyms]    [k] nft_do_chain
   9.84%  [kernel.kallsyms]    [k] __netif_receive_skb_core
   7.10%  [kernel.kallsyms]    [k] tcp_v4_rcv
   5.93%  [kernel.kallsyms]    [k] _raw_spin_lock
   4.41%  [kernel.kallsyms]    [k] csum_partial

Everything hot is in the kernel and everything hot is the network receive path: nft_do_chain is nftables firewall evaluation, the rest is packet processing and checksumming inside softirq context. The machine is spending its CPU shovelling packets through a fat ruleset before your process ever sees a byte, which is why no process owns the time. The fix lives in firewall rule structure or NIC offload settings, not in your service — a conclusion you could not reach from top at any zoom level. This is the move to remember: when system-level CPU and the per-process accounting disagree, perf top is the arbiter, and the broader triage tree is in what's eating my CPU?

Regression hunting with perf stat

A batch job got 40% slower after a refactor that "should not have changed anything." Wall-clock time tells you it is slower; perf stat tells you what kind of slower. Run both versions on the same input, same machine, and diff the counters:

$ perf stat -r 5 -- ./builder-v1 fixtures/may.db
     6,021,394,118      instructions      #    2.41  insn per cycle
        41,202,116      cache-misses      #    9.8% of all cache refs
       1.044 seconds time elapsed ( +- 0.31% )
$ perf stat -r 5 -- ./builder-v2 fixtures/may.db
     6,180,242,330      instructions      #    0.97  insn per cycle
       302,418,209      cache-misses      #   44.1% of all cache refs
       1.471 seconds time elapsed ( +- 0.42% )

Read the diff before reading the code. Instruction count barely moved — the refactor executes almost the same work — but IPC fell from 2.41 to 0.97 and cache misses went up seven-fold. The new code is not doing more, it is doing the same things in a memory-hostile order. That fingerprint usually means a layout change: an array of structs became a struct of pointers, a contiguous buffer became a linked structure, a hot loop now walks objects scattered across the heap. The -r 5 runs the command five times and prints means with variance, which is what makes the comparison trustworthy; single runs of anything are noise. With the counters pointing at memory layout, the follow-up is perf record -e cache-misses -g on v2 to find exactly which loop misses.

Sampling versus tracing, and how to read a flame graph

It is worth being honest about what a sampling profiler can and cannot see, because the failure mode is silent. A sampler looks at the machine N times a second. Anything that holds the CPU for a meaningful share of the window shows up in proportion to the share it holds. Anything brief and rare slips between samples: a function that runs for 200 microseconds once a second will essentially never be caught at 99 Hz, even if that 200 microseconds is your entire latency problem. Sampling answers "where does the CPU time go in aggregate." It does not answer "what happened during this one slow request," and it does not see off-CPU time at all — a thread blocked on a lock or a disk read is invisible to a CPU sampler precisely because it is not on the CPU.

Tracing is the opposite trade. strace records every system call a process makes — nothing slips through — but the recording mechanism stops the process at every call boundary, and on a syscall-heavy workload that overhead reaches tens of multiples. perf's sampling overhead is typically low single-digit percent, gentle enough to run against production. So the division of labour: sampling for "where does the time go," run freely; tracing for "show me each event," run with care and ideally not on the hot path. When the question is per-event and the overhead of strace is unacceptable, perf's own tracepoint events split the difference.

Sampling in one picture. The interrupts do not measure durations; they catch moments, and the proportions of the catches estimate the proportions of the time. The dashed sliver is the kind of thing sampling never sees.

Flame graphs are what call-stack samples become when you stack them up. perf report shows the same data as an expandable tree, which works until the profile has hundreds of distinct stacks; the flame graph shows all of them at once. The classic pipeline is Brendan Gregg's scripts: perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg, and newer perf versions can emit one directly. Reading one takes a single rule and one warning.

Each row is a stack depth; each box is a function; a box sits on the box that called it. validate_row's 38% rides almost entirely on render_table, which is the call chain from the first scenario, drawn instead of printed.

The rule: width is inclusive time. A box's width is the fraction of samples whose stack contained that function, whether it was executing or merely an ancestor of what was. Wide box at the bottom: everything above it accounts to it. Wide flat-topped box with nothing above: samples landed in that function's own code, and that plateau is your hotspot. The warning: the x-axis is not time. Boxes are sorted alphabetically to make merging deterministic, so left-of does not mean before. People who read flame graphs as timelines invent causality that is not there. Width means amount; position means nothing.

How it works underneath

The machinery is worth ten minutes because every confusing perf behaviour traces back to it. Modern CPUs ship a Performance Monitoring Unit, a set of hardware counters that can be programmed to count events — cycles retired, instructions retired, cache misses at each level, branch mispredictions — at zero cost to the running code. The counters can also be set to overflow: count down from N, and raise an interrupt when you hit zero. Set N to "cycles per sample at 4000 Hz" and the overflow interrupt becomes the sampling tick: the handler wakes, records the instruction pointer (and with -g, walks the call stack), resets the counter, and returns. That is the whole trick. perf stat uses the same counters in plain counting mode, no interrupts, which is why its overhead is close to nothing.

The kernel exposes all of this through one syscall, perf_event_open(2), which hands back a file descriptor per event per CPU; the perf tool is a userspace client that opens these descriptors, mmaps ring buffers for the sample stream, and formats what arrives. This design is why other tools can do what perf does — profilers and eBPF tooling sit on the same syscall — and why permissions work the way they do: access is gated by perf_event_paranoid and capabilities rather than by anything perf-specific.

Samples arrive as raw addresses, and the gap between an address and a name is where profiles go to die. For kernel addresses, perf reads the kernel's own symbol table via /proc/kallsyms, which is why kernel rows look clean on a stock system. For user code, it maps the address through the process's loaded binaries and reads their symbol tables — which works exactly as well as the binaries' symbols allow. Stripped binary: hex addresses. Distro package without its debuginfo counterpart: hex addresses or bare offsets. JIT-compiled code: addresses that belong to no binary at all, unless the runtime writes a /tmp/perf-PID.map file mapping generated code to names — the JVM needs -XX:+PreserveFramePointer plus an agent such as perf-map-agent (or async-profiler, which speaks perf's formats), Node needs --perf-basic-prof.

Call stacks have their own failure mode. The cheap way to walk a stack is to follow frame pointers: each function saves a register pointing at its caller's frame, and the unwinder just chases the chain. But compilers treat that register as a free general-purpose register when told to, and -fomit-frame-pointer has been a common default at -O2 for years — so the chain is broken before perf arrives, and stacks come out one or two frames deep ending in nonsense. The fixes, in order of preference: build with -fno-omit-frame-pointer (several distros have moved their entire package sets to this, precisely so profilers work); or record with --call-graph dwarf, which copies a chunk of raw stack into every sample and unwinds it later using DWARF debug info — accurate, but the data files balloon and deep stacks get truncated; or --call-graph lbr on Intel hardware, which has the CPU itself record the last few branches, cheap but short. Containers stack a second problem on top: perf resolves symbols through paths like /proc/PID/root/..., and a profiler running on the host must find binaries that live inside the container's mount namespace — modern perf handles this, older builds quietly print hex. When a container profile looks like garbage, suspect the namespace before the workload.

Pitfalls

Profiling without symbols and trusting the hex. A profile where the hot rows are 0x00007f3a91c41b80 is not a profile, it is a shrug. Before recording anything you intend to act on, check the trifecta: debug symbols installed for your binary and its hot libraries, frame pointers present if you want stacks (or budget for DWARF unwinding), and a JIT map if a managed runtime is involved. Five minutes of setup against an afternoon of squinting at addresses.

Running it unprivileged and misreading the result. The kernel.perf_event_paranoid sysctl gates what non-root users may observe; at common defaults an unprivileged perf top sees only your own processes and no kernel samples, and at paranoid settings it refuses outright. Separately, kernel.kptr_restrict hides kernel symbol addresses, so even when kernel samples arrive they show as raw hex or a single unresolvable blob. The trap is the quiet version: a profile that silently omits all kernel time looks complete and is wrong. For anything incident-shaped, run perf as root and the question evaporates.

perf inside containers. Three distinct failures compound here. The container usually lacks the privileges to open perf events (CAP_PERFMON, or CAP_SYS_ADMIN on older kernels, and the default seccomp profiles of container runtimes block perf_event_open). The perf binary inside the image may not match the host kernel version. And symbol resolution crosses mount namespaces, as above. The pragmatic pattern is to profile from the host: host perf can see container processes as ordinary PIDs, and one privileged toolbox on the node beats fighting three problems inside every image.

Sampling bias on short-lived work. A sampler is fair to whatever runs long enough to be caught, and structurally blind to what does not. A fleet of worker processes that each live 80 milliseconds will barely register in a 99 Hz profile even if they collectively own the machine, because each one exists for eight ticks. For workloads like that, raise the frequency, profile system-wide rather than per-PID so samples land in whichever incarnation is alive, or switch to an event-driven view (fork/exec tracepoints, or simply perf stat around the spawning parent). When a profile and the CPU graph disagree, believe the graph and ask what the profiler cannot see.

Forgetting that perf.data is a file with opinions. perf record writes perf.data into the current directory and perf report reads the same path, which is convenient until you record twice and report on the wrong run, or copy the file to another machine and discover symbol resolution wanted the original binaries (perf archive exists for exactly that). Name your captures (-o hot-after-deploy.data) and report with -i.

A drill you can run right now

Everything below is safe on any Linux machine or VM: it reads counters, watches the system for ten seconds, and profiles a throwaway dd that copies zeroes to nowhere. You will need perf installed (linux-tools-common plus the kernel-matched package on Debian/Ubuntu, perf on Fedora/Arch) and sudo for the second and third steps.

Step 1 — counters on something tiny. Run perf stat -- ls. The point is not the listing, it is seeing the counter block on a command so small the numbers are legible. Find the IPC line and read it as a verdict on the run. Run it again on perf stat -- ls -R /usr/share and watch the same counters describe a heavier job; compare the IPC and the cache-miss percentage between the two and notice you can already tell which one spends more of its life waiting on memory.

Step 2 — ten seconds of the live view. Run sudo perf top and just watch for ten seconds. On an idle machine the list is sparse and dominated by kernel housekeeping; that is its own lesson, because you now know what baseline looks like. Open a browser tab or run anything noisy in another terminal and watch the symbols rearrange. Find one [k] row and one [.] row and say out loud what the difference is. Press q to leave.

Step 3 — record, then report. Profile a deliberately CPU-flavoured job and read the result:

$ sudo perf record -g -- dd if=/dev/zero of=/dev/null count=2000000
2000000+0 records in
2000000+0 records out
1024000000 bytes (1.0 GB) copied, 1.09 s, 943 MB/s
[ perf record: Captured and wrote 0.482 MB perf.data (4127 samples) ]
$ sudo perf report --stdio | head -12
# Overhead  Command  Shared Object       Symbol
    24.30%  dd       [kernel.kallsyms]   [k] __clear_user
    11.84%  dd       [kernel.kallsyms]   [k] entry_SYSCALL_64
     8.92%  dd       [kernel.kallsyms]   [k] syscall_return_via_sysret
     6.45%  dd       [kernel.kallsyms]   [k] vfs_read
     5.07%  dd       libc.so.6           [.] read

Read what the profile says about a program you thought you understood. dd is "just copying," yet almost every hot symbol is kernel-side: __clear_user is the kernel manufacturing the zeroes that /dev/zero hands out, and the syscall-entry/exit symbols are the toll booth paid two million times, once per read/write pair at the default tiny block size. Now run the same copy with bs=1M count=1000, record again, and watch the syscall overhead rows shrink — same bytes, a thousandth of the crossings. You have just used a profiler to find and fix a real inefficiency, on a one-line program, with the same three commands you would use on a production service.

If you remember one line. sudo perf top for "what is this machine doing right now," sudo perf record -g -p PID -- sleep 30 then sudo perf report for "where does this service's CPU go," and perf stat -r 5 -- cmd before and after any change you claim made something faster.

perf

The question it answers

The five invocations that matter

Reading the output

Three production scenarios

The hot function in a busy service

CPU is high but my process looks idle-ish

Regression hunting with perf stat

Sampling versus tracing, and how to read a flame graph

How it works underneath

Pitfalls

A drill you can run right now

Further reading

21 — nice, ionice & cgroups