perf
top told you which process is burning CPU. The next question is harder and more
useful: which function? Which loop, which lock, which memcpy is the processor actually
executing, right now, instruction by instruction? That is the question perf answers,
by sampling the machine thousands of times a second and recording what it catches. This page
covers the five invocations worth knowing, reads a perf top and a
perf stat output line by line, walks three production hunts, explains flame graphs
and why stacks come out broken in containers, and ends with a drill that will not hurt anything.
The question it answers
Every tool earlier in this sequence works at the granularity of a process. top and htop tell you that PID 41327 is using 340% CPU. What's eating my CPU? walks the decision tree from a hot machine to a guilty process. But a process is a big, opaque thing. Saying "java is using the CPU" is like saying "the building is using the electricity" — true, unhelpful. The work happens in functions, and the question that actually leads to a fix is which function the processor spends its cycles in.
perf answers that question by sampling. Many times per second, on every CPU, it
arranges for an interrupt that asks one tiny question: what instruction is executing right this
instant, and in which function does that instruction live? Each answer is one sample. Collect a
few thousand and the histogram of where the samples landed is, statistically, a map of where
the CPU time goes. If 31% of samples land inside hash_lookup, the machine spends
roughly 31% of its cycles there. No instrumentation, no recompiling, no restarting the service.
You point it at the whole system or one PID and it tells you what the silicon is doing.
It does more than sample, which is both its strength and the reason its man pages feel endless. The same tool reads the CPU's hardware counters (how many instructions, how many cache misses, how many branch mispredictions), traces kernel events, and annotates assembly. You will use a small fraction of it. The fraction below covers most CPU investigations a working engineer runs, and the profiling page covers where this kind of measurement sits in the wider performance method.
The five invocations that matter
perf is a multiplexer: perf top, perf record,
perf report, perf stat, and a few dozen more subcommands you can
ignore for years. Five invocations cover the daily work.
| Invocation | What it does | When you reach for it |
|---|---|---|
perf top | Live, whole-system function histogram, updated like top | First look at a hot machine: "what is everyone doing?" |
perf top -p 41327 | The same live view, narrowed to one process | You already know the PID and want its hot functions now |
perf record -g -p 41327 -- sleep 30 | Records 30 seconds of samples with call stacks to perf.data | Anything you want to study, share, or turn into a flame graph |
perf report | Interactive browser over a recorded perf.data | Reading what record captured, drilling into call chains |
perf stat -- cmd | Counter totals for one run: cycles, instructions, IPC, cache misses | Before/after comparisons, "is this workload CPU-bound or memory-bound?" |
Two details in the record line are worth spelling out. -g asks for call stacks
with every sample, not just the instruction pointer — without it you learn that
memcpy is hot but not who called it, and the answer to "who called it" is usually
the whole investigation. The trailing -- sleep 30 is a timer trick: record runs
until the command after -- exits, so attaching to a PID and running
sleep 30 gives you a clean thirty-second window. You can also wrap a command
directly, perf record -g -- ./myserver --bench, and record exactly that run.
The fifth invocation has a sibling that opens the rest of the tool:
perf record -e selects which event drives the sampling. The default
event is CPU cycles, which answers "where does time go." But you can sample on
cache-misses to find the code that misses the cache, on
branch-misses for misprediction hotspots, or on kernel tracepoints like
block:block_rq_issue to catch who submits disk I/O. perf list prints
the menu, which on a modern CPU runs to hundreds of entries. You will not need most of them,
but knowing the door exists changes what kind of questions you think to ask.
perf record samples around
4000 times per second per CPU by default; tools and guides often use -F 99 — 99 Hz
— for long captures. The odd number is not superstition. Sampling at exactly 100 Hz risks
running in lockstep with timers and periodic work that also fire at round frequencies, so every
sample would catch the same phase of the same loop. 99 Hz drifts relative to everything
periodic and samples fairly.Reading the output
Here is a perf top from a busy box running a Java service behind nginx. Run it as
root; the pitfalls section covers what happens when you do not.
$ sudo perf top Samples: 48K of event 'cycles:P', 4000 Hz, Event count (approx.): 31204418329 Overhead Shared Object Symbol 14.21% libjvm.so [.] SpinPause 9.87% perf-41327.map [.] Lcom/acme/cache/ShardMap;::lookup 7.45% [kernel.kallsyms] [k] copy_user_enhanced_fast_string 5.12% libc.so.6 [.] __memmove_avx_unaligned_erms 3.96% [kernel.kallsyms] [k] _raw_spin_lock 2.31% nginx [.] ngx_http_parse_header_line 1.88% libjvm.so [.] 0x00000000007c41b3 1.40% [kernel.kallsyms] [k] tcp_sendmsg_locked
Three columns. Overhead is the fraction of samples that landed in this symbol — not a measured duration, a share of the histogram. The percentages are relative to the samples collected, so 14.21% means "of everything every CPU was doing during this window, one seventh of the caught moments were inside SpinPause." Shared Object is which binary or library the instruction belonged to, and it carries more diagnostic weight than it looks like it should. Symbol is the function name, when perf can resolve one.
The shared object column is a sorting hat. [kernel.kallsyms] means the sample
landed in kernel code — the [k] tag on the symbol says the same thing — so this
CPU time will show up as system time, not user time, in top. A profile dominated by kernel
symbols says the process is making the kernel work: copying data to and from user space
(copy_user_enhanced_fast_string is the kernel's bulk copy, usually driven by
read/write syscalls), contending on spinlocks, running the network stack. A file like
perf-41327.map is a JIT map: runtimes that compile code at runtime — the JVM,
V8, some Python JITs — can write a map file telling perf which generated code lives at which
address, which is how a Java method name shows up in a profile of native instruction pointers.
And a bare hex address like 0x00000000007c41b3 where a name should be means perf
found code it cannot name: no symbols for that region. One stray row of hex is cosmetic; a
profile that is mostly hex is unusable, and the pitfalls section covers the fixes.
Now the counter view. perf stat does no sampling at all; it programs the CPU's
hardware counters at the start of a command, reads them at the end, and prints totals.
$ perf stat -- ./report-builder --month 2026-05 Performance counter stats for './report-builder --month 2026-05': 4,182.41 msec task-clock # 0.992 CPUs utilized 14,208,419,206 cycles # 3.397 GHz 6,391,532,118 instructions # 0.45 insn per cycle 1,242,816,004 branches # 297.152 M/sec 18,420,711 branch-misses # 1.48% of all branches 412,396,221 cache-references 198,123,460 cache-misses # 48.04% of all cache refs 4.215042512 seconds time elapsed
The line to read first is insn per cycle, IPC. A modern out-of-order core can retire four or more instructions every cycle when the pipeline is fed; this run managed 0.45. That gap is the story. Low IPC means the core spent most of its cycles stalled, waiting for something — and the cache-misses line two rows down names the something: nearly half of all cache references missed, so the core sat idle while loads crawled out to main memory. This program is not compute-bound, it is memory-bound, and making its arithmetic faster will do nothing. Conversely an IPC of 2 or 3 with low miss rates says the core is genuinely busy executing, and the only way to go faster is to execute less. One number, and it redirects the whole optimisation effort. The systematic version of this reading — start at cycles, ask whether they retired work or stalled, then ask why — is the top-down method, and the machinery being measured, the fetch-decode-execute pipeline these counters watch, is the subject of the instruction cycle.
Branch misses get the same treatment: 1.48% is healthy, while several percent on a hot path
means the core keeps guessing wrong about which way the code goes and throwing away
speculative work. The general rule for perf stat is that the absolute numbers
mean little on their own; the ratios (IPC, miss percentages) and the deltas between two runs
are where the information lives.
Three production scenarios
The hot function in a busy service
Latency on a service crept up over a week of deploys and CPU per request has roughly doubled. Nothing in the metrics says why. This is the canonical record-then-report hunt:
$ sudo perf record -g -p 41327 -- sleep 30 [ perf record: Woken up 142 times to write data ] [ perf record: Captured and wrote 38.412 MB perf.data (412086 samples) ] $ sudo perf report --stdio | head -20 # Overhead Command Shared Object Symbol 38.41% server server [.] validate_row | --- validate_row |--96.2%-- render_table | build_response | handle_request --3.8%-- import_batch
With -g the report shows not just that validate_row eats 38% of the
CPU but that 96% of the calls into it come through render_table. That call chain
is the diagnosis: someone made rendering re-validate every row, probably as a defensive check
that was cheap on the test dataset. Without stacks you would know the what but not the from
where, and the from-where is what tells you which code to change. For profiles bigger than a
screen, turn the same data into a flame graph — covered two sections down — where this exact
pattern shows up as one wide tower you can see across the room.
CPU is high but my process looks idle-ish
The host graph says 80% CPU. top says your service is using 35% and nothing else
is using much, and the columns that explain the difference are sy, hi
and si in the header — system time, hardware interrupts, softirqs. None of that
time is attributed to a user process in the per-process list, so the per-process list lies by
omission. perf top does not, because it samples whatever the CPU is doing,
including the kernel acting on nobody's behalf in particular:
$ sudo perf top Overhead Shared Object Symbol 11.62% [kernel.kallsyms] [k] nft_do_chain 9.84% [kernel.kallsyms] [k] __netif_receive_skb_core 7.10% [kernel.kallsyms] [k] tcp_v4_rcv 5.93% [kernel.kallsyms] [k] _raw_spin_lock 4.41% [kernel.kallsyms] [k] csum_partial
Everything hot is in the kernel and everything hot is the network receive path:
nft_do_chain is nftables firewall evaluation, the rest is packet processing and
checksumming inside softirq context. The machine is spending its CPU shovelling packets
through a fat ruleset before your process ever sees a byte, which is why no process owns the
time. The fix lives in firewall rule structure or NIC offload settings, not in your service —
a conclusion you could not reach from top at any zoom level. This is the move to
remember: when system-level CPU and the per-process accounting disagree,
perf top is the arbiter, and the broader triage tree is in
what's eating my CPU?
Regression hunting with perf stat
A batch job got 40% slower after a refactor that "should not have changed anything."
Wall-clock time tells you it is slower; perf stat tells you what kind of slower.
Run both versions on the same input, same machine, and diff the counters:
$ perf stat -r 5 -- ./builder-v1 fixtures/may.db 6,021,394,118 instructions # 2.41 insn per cycle 41,202,116 cache-misses # 9.8% of all cache refs 1.044 seconds time elapsed ( +- 0.31% ) $ perf stat -r 5 -- ./builder-v2 fixtures/may.db 6,180,242,330 instructions # 0.97 insn per cycle 302,418,209 cache-misses # 44.1% of all cache refs 1.471 seconds time elapsed ( +- 0.42% )
Read the diff before reading the code. Instruction count barely moved — the refactor executes
almost the same work — but IPC fell from 2.41 to 0.97 and cache misses went up seven-fold. The
new code is not doing more, it is doing the same things in a memory-hostile order. That
fingerprint usually means a layout change: an array of structs became a struct of pointers,
a contiguous buffer became a linked structure, a hot loop now walks objects scattered across
the heap. The -r 5 runs the command five times and prints means with variance,
which is what makes the comparison trustworthy; single runs of anything are noise. With the
counters pointing at memory layout, the follow-up is
perf record -e cache-misses -g on v2 to find exactly which loop misses.
Sampling versus tracing, and how to read a flame graph
It is worth being honest about what a sampling profiler can and cannot see, because the failure mode is silent. A sampler looks at the machine N times a second. Anything that holds the CPU for a meaningful share of the window shows up in proportion to the share it holds. Anything brief and rare slips between samples: a function that runs for 200 microseconds once a second will essentially never be caught at 99 Hz, even if that 200 microseconds is your entire latency problem. Sampling answers "where does the CPU time go in aggregate." It does not answer "what happened during this one slow request," and it does not see off-CPU time at all — a thread blocked on a lock or a disk read is invisible to a CPU sampler precisely because it is not on the CPU.
Tracing is the opposite trade. strace records every system call a process makes — nothing slips through — but the recording mechanism stops the process at every call boundary, and on a syscall-heavy workload that overhead reaches tens of multiples. perf's sampling overhead is typically low single-digit percent, gentle enough to run against production. So the division of labour: sampling for "where does the time go," run freely; tracing for "show me each event," run with care and ideally not on the hot path. When the question is per-event and the overhead of strace is unacceptable, perf's own tracepoint events split the difference.
Flame graphs are what call-stack samples become when you stack them up.
perf report shows the same data as an expandable tree, which works until the
profile has hundreds of distinct stacks; the flame graph shows all of them at once. The
classic pipeline is Brendan Gregg's scripts:
perf script | stackcollapse-perf.pl | flamegraph.pl > out.svg, and newer perf
versions can emit one directly. Reading one takes a single rule and one warning.
The rule: width is inclusive time. A box's width is the fraction of samples whose stack contained that function, whether it was executing or merely an ancestor of what was. Wide box at the bottom: everything above it accounts to it. Wide flat-topped box with nothing above: samples landed in that function's own code, and that plateau is your hotspot. The warning: the x-axis is not time. Boxes are sorted alphabetically to make merging deterministic, so left-of does not mean before. People who read flame graphs as timelines invent causality that is not there. Width means amount; position means nothing.
How it works underneath
The machinery is worth ten minutes because every confusing perf behaviour traces back to it.
Modern CPUs ship a Performance Monitoring Unit, a set of hardware counters that can be
programmed to count events — cycles retired, instructions retired, cache misses at each level,
branch mispredictions — at zero cost to the running code. The counters can also be set to
overflow: count down from N, and raise an interrupt when you hit zero. Set N to "cycles per
sample at 4000 Hz" and the overflow interrupt becomes the sampling tick: the handler wakes,
records the instruction pointer (and with -g, walks the call stack), resets the
counter, and returns. That is the whole trick. perf stat uses the same counters
in plain counting mode, no interrupts, which is why its overhead is close to nothing.
The kernel exposes all of this through one syscall, perf_event_open(2), which
hands back a file descriptor per event per CPU; the perf tool is a userspace client that opens
these descriptors, mmaps ring buffers for the sample stream, and formats what arrives. This
design is why other tools can do what perf does — profilers and eBPF tooling sit on the same
syscall — and why permissions work the way they do: access is gated by
perf_event_paranoid and capabilities rather than by anything perf-specific.
Samples arrive as raw addresses, and the gap between an address and a name is where profiles
go to die. For kernel addresses, perf reads the kernel's own symbol table via
/proc/kallsyms, which is why kernel rows look clean on a stock system. For user
code, it maps the address through the process's loaded binaries and reads their symbol tables
— which works exactly as well as the binaries' symbols allow. Stripped binary: hex addresses.
Distro package without its debuginfo counterpart: hex addresses or bare offsets. JIT-compiled
code: addresses that belong to no binary at all, unless the runtime writes a
/tmp/perf-PID.map file mapping generated code to names — the JVM needs
-XX:+PreserveFramePointer plus an agent such as perf-map-agent (or async-profiler,
which speaks perf's formats), Node needs --perf-basic-prof.
Call stacks have their own failure mode. The cheap way to walk a stack is to follow frame
pointers: each function saves a register pointing at its caller's frame, and the unwinder
just chases the chain. But compilers treat that register as a free general-purpose register
when told to, and -fomit-frame-pointer has been a common default at
-O2 for years — so the chain is broken before perf arrives, and stacks come out
one or two frames deep ending in nonsense. The fixes, in order of preference: build with
-fno-omit-frame-pointer (several distros have moved their entire package sets to
this, precisely so profilers work); or record with --call-graph dwarf, which
copies a chunk of raw stack into every sample and unwinds it later using DWARF debug info —
accurate, but the data files balloon and deep stacks get truncated; or
--call-graph lbr on Intel hardware, which has the CPU itself record the last
few branches, cheap but short. Containers stack a second problem on top: perf resolves
symbols through paths like /proc/PID/root/..., and a profiler running on the
host must find binaries that live inside the container's mount namespace — modern perf
handles this, older builds quietly print hex. When a container profile looks like garbage,
suspect the namespace before the workload.
Pitfalls
Profiling without symbols and trusting the hex. A profile where the hot rows
are 0x00007f3a91c41b80 is not a profile, it is a shrug. Before recording anything
you intend to act on, check the trifecta: debug symbols installed for your binary and its hot
libraries, frame pointers present if you want stacks (or budget for DWARF unwinding), and a
JIT map if a managed runtime is involved. Five minutes of setup against an afternoon of
squinting at addresses.
Running it unprivileged and misreading the result. The
kernel.perf_event_paranoid sysctl gates what non-root users may observe; at
common defaults an unprivileged perf top sees only your own processes and no
kernel samples, and at paranoid settings it refuses outright. Separately,
kernel.kptr_restrict hides kernel symbol addresses, so even when kernel samples
arrive they show as raw hex or a single unresolvable blob. The trap is the quiet version:
a profile that silently omits all kernel time looks complete and is wrong. For anything
incident-shaped, run perf as root and the question evaporates.
perf inside containers. Three distinct failures compound here. The container
usually lacks the privileges to open perf events (CAP_PERFMON, or
CAP_SYS_ADMIN on older kernels, and the default seccomp profiles of container
runtimes block perf_event_open). The perf binary inside the image may not match
the host kernel version. And symbol resolution crosses mount namespaces, as above. The
pragmatic pattern is to profile from the host: host perf can see container processes as
ordinary PIDs, and one privileged toolbox on the node beats fighting three problems inside
every image.
Sampling bias on short-lived work. A sampler is fair to whatever runs long
enough to be caught, and structurally blind to what does not. A fleet of worker processes that
each live 80 milliseconds will barely register in a 99 Hz profile even if they collectively
own the machine, because each one exists for eight ticks. For workloads like that, raise the
frequency, profile system-wide rather than per-PID so samples land in whichever incarnation is
alive, or switch to an event-driven view (fork/exec tracepoints, or simply
perf stat around the spawning parent). When a profile and the CPU graph disagree,
believe the graph and ask what the profiler cannot see.
Forgetting that perf.data is a file with opinions. perf record
writes perf.data into the current directory and perf report reads
the same path, which is convenient until you record twice and report on the wrong run, or
copy the file to another machine and discover symbol resolution wanted the original binaries
(perf archive exists for exactly that). Name your captures
(-o hot-after-deploy.data) and report with -i.
A drill you can run right now
Everything below is safe on any Linux machine or VM: it reads counters, watches the system for
ten seconds, and profiles a throwaway dd that copies zeroes to nowhere. You will
need perf installed (linux-tools-common plus the kernel-matched package on
Debian/Ubuntu, perf on Fedora/Arch) and sudo for the second and third steps.
Step 1 — counters on something tiny. Run perf stat -- ls. The
point is not the listing, it is seeing the counter block on a command so small the numbers are
legible. Find the IPC line and read it as a verdict on the run. Run it again on
perf stat -- ls -R /usr/share and watch the same counters describe a heavier job;
compare the IPC and the cache-miss percentage between the two and notice you can already tell
which one spends more of its life waiting on memory.
Step 2 — ten seconds of the live view. Run sudo perf top and
just watch for ten seconds. On an idle machine the list is sparse and dominated by kernel
housekeeping; that is its own lesson, because you now know what baseline looks like. Open a
browser tab or run anything noisy in another terminal and watch the symbols rearrange. Find
one [k] row and one [.] row and say out loud what the difference is.
Press q to leave.
Step 3 — record, then report. Profile a deliberately CPU-flavoured job and read the result:
$ sudo perf record -g -- dd if=/dev/zero of=/dev/null count=2000000 2000000+0 records in 2000000+0 records out 1024000000 bytes (1.0 GB) copied, 1.09 s, 943 MB/s [ perf record: Captured and wrote 0.482 MB perf.data (4127 samples) ] $ sudo perf report --stdio | head -12 # Overhead Command Shared Object Symbol 24.30% dd [kernel.kallsyms] [k] __clear_user 11.84% dd [kernel.kallsyms] [k] entry_SYSCALL_64 8.92% dd [kernel.kallsyms] [k] syscall_return_via_sysret 6.45% dd [kernel.kallsyms] [k] vfs_read 5.07% dd libc.so.6 [.] read
Read what the profile says about a program you thought you understood. dd is
"just copying," yet almost every hot symbol is kernel-side: __clear_user is the
kernel manufacturing the zeroes that /dev/zero hands out, and the
syscall-entry/exit symbols are the toll booth paid two million times, once per
read/write pair at the default tiny block size. Now run the same
copy with bs=1M count=1000, record again, and watch the syscall overhead rows
shrink — same bytes, a thousandth of the crossings. You have just used a profiler to find and
fix a real inefficiency, on a one-line program, with the same three commands you would use on
a production service.
sudo perf top for "what is this machine
doing right now," sudo perf record -g -p PID -- sleep 30 then
sudo perf report for "where does this service's CPU go," and
perf stat -r 5 -- cmd before and after any change you claim made something faster.Further reading
- Brendan Gregg — perf examples — the working reference: one page of copy-pasteable invocations for nearly every situation, maintained by the person who built the flame graph.
- The perf wiki — the official tutorial and documentation home for the tool itself.
- perf_event_open(2) — the syscall under everything; the long descriptions of event types and sample formats explain what perf is actually asking the kernel for.
- Flame graphs — the canonical explanation, including the off-CPU and differential variants this page did not cover.