What's eating my CPU?
The alert says CPU is above 85% and has been for ten minutes. That is not a diagnosis; it is barely a symptom. Between that alert and a root cause there is a short, repeatable path: decide whether the box is busy or merely waiting, find the process, find the thread inside the process, find the function inside the thread, and recognise which of four endings you have landed in. This page walks that path the way a senior engineer walks it during a real incident — each step is a command, the output you will actually see, and the decision the output forces.
Step 0 — decide what kind of high
Before you hunt for a process, settle a prior question: is the CPU actually doing work, or is the machine just waiting in a way that makes the graphs look bad? The two have completely different investigations, and the alert cannot tell them apart. Most "high CPU" alerts are really "high load" or "high utilisation" alerts, and load is a count of demand, not a measure of compute.
Start with the cheapest possible reading: load average against core count.
$ uptime 14:32:09 up 41 days, 3:17, 2 users, load average: 9.12, 8.77, 6.40 $ nproc 8 ← 9.12 demand on 8 cores: saturated, and the 1-min > 15-min trend says it's getting worse
The three numbers are demand averaged over one, five, and fifteen minutes. Read them against
nproc: a load of 9 on 8 cores means that, on average, nine tasks wanted a core
at once and one of them was always queueing. The trend matters as much as the level — 9.12
rising from 6.40 is an event in progress; 9.12 falling from 14 is an event you already
missed. But here is the catch that separates this from a real diagnosis: on Linux, the load
average counts not only tasks running and waiting to run, but also tasks in uninterruptible
sleep — usually waiting on disk I/O. A machine with a dying disk and an idle CPU can post a
load of 40. So a high load average forks the investigation immediately, and one cheap command
settles which branch you are on:
$ vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 9 0 0 982160 211344 601724 0 0 0 24 9112 14233 84 11 4 0 1 10 0 0 981004 211344 601726 0 0 0 0 9376 14890 86 10 4 0 0 9 1 0 980512 211344 601728 0 0 0 132 9018 14507 85 11 3 1 0 r = runnable tasks, b = blocked on I/O. r ≈ 9, b ≈ 0, wa ≈ 0: the CPU is genuinely busy
Two columns decide everything. r is the run queue: tasks that want CPU right
now. b is tasks blocked in uninterruptible sleep, almost always on storage. If
r is high and b is near zero, with us and
sy eating the CPU columns, the box is compute-bound and this page is your
investigation. If instead you see r at 1, b at 7, and
wa at 60-plus while id would be high if the disk ever answered —
that is iowait. The CPU is not eating anything; it is starving politely while storage holds
everyone up. That is a disk investigation, and continuing down this page would waste twenty
minutes profiling a process that is mostly asleep. The general habit of checking utilisation,
saturation, and errors per resource before committing to a theory is the
USE method, and this step is that
method applied in fifteen seconds. The machinery behind the run queue — why nine runnable
tasks on eight cores means queueing, and how the kernel decides who runs — is the subject of
scheduling.
wa high, b > 0,
CPU columns low → it is the disk, go investigate storage. Load high, r high,
us+sy high → the CPU is really burning, continue to step 1.Step 1 — top: which process, and what shape
Now find the consumer. top sorted by CPU answers "which process" in five
seconds, but it answers two more questions in the same screen if you know where to look, so
do not glance at the first row and bail.
$ top -o %CPU top - 14:33:05 up 41 days, 3:18, 2 users, load average: 9.12, 8.77, 6.40 Tasks: 241 total, 3 running, 238 sleeping, 0 stopped, 0 zombie %Cpu(s): 84.6 us, 10.9 sy, 0.0 ni, 3.7 id, 0.1 wa, 0.0 hi, 0.4 si, 0.3 st MiB Mem : 31954.4 total, 2210.9 free, 21311.2 used, 8432.3 buff/cache PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31764 deploy 20 0 12.1g 3.4g 41208 R 742.3 10.9 312:44.61 feedsvc 1290 root 20 0 142936 38104 29412 S 4.3 0.1 88:10.02 nginx 988 root 20 0 1893220 92140 54372 S 1.7 0.3 412:03.77 containerd
First read: one process, feedsvc, at 742% — process %CPU is per-core, so 742%
on an 8-core box means it owns roughly seven and a half cores. That is your suspect. (If
instead you find no single hot process, just dozens of small ones summing to saturation,
the story is fan-out — a cron storm, a fork bomb, a worker pool sized wrong — and the fix is
about how many processes exist, not what any one of them is doing.)
Second read: press 1. The summary line explodes into one line per core, and the
shape of the heat tells you something the totals hide.
(inside top, press 1) %Cpu0 : 88.2 us, 10.8 sy, 0.0 ni, 0.5 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st %Cpu1 : 86.0 us, 11.9 sy, 0.0 ni, 1.6 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st %Cpu2 : 89.1 us, 9.4 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st %Cpu3 : 85.7 us, 12.1 sy, 0.0 ni, 1.7 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st …all eight look like this: the load is systemic, not one runaway thread
Compare that with the other shape you will meet, where seven cores idle while one reads 99%
us. One pegged core means a single hot thread: a single-threaded runtime maxed
out, one stuck worker spinning, an infinite loop in one request handler. A process can never
use more than one core per runnable thread, so "the service is at 100% CPU" on a 16-core
machine sometimes means it is at 100% of the one core it can use while the box is
94% idle — saturated and underutilised at the same time. All cores hot means the work is
spread: a busy thread pool, GC churning on parallel threads, or genuinely too much traffic.
The two shapes lead to different next steps, which is why pressing 1 is not
optional.
Third read: the us / sy split in the header. us is
your code running in user space; sy is the kernel running on your behalf —
servicing syscalls, copying buffers, handling page faults. A profile of 85/11 like the one
above says the burn is in application code, so the path ahead is a profiler. If the ratio is
inverted — 20 us, 70 sy — your code is barely computing; it is
asking the kernel to do something at an absurd rate, and the path ahead is syscall counting
(step 4). And keep one eye on st, steal time: CPU your VM wanted but the
hypervisor gave to someone else. We will come back to it in the endings, because a high
st ends the investigation on the spot.
This page deliberately uses top as a waypoint, not a destination — the full tour
of its fields, the htop comparison, and the keystrokes worth knowing live in
top & htop. For the investigation you
need exactly three things from it: the PID, the core shape, and the us/sy/st split.
%us dominant → step 2, find the
thread. %sy dominant → step 4, count syscalls. %st meaningfully
above zero → skip to the noisy-neighbour ending. No single hot process → look for fan-out
(process count), not a culprit.Step 2 — which thread inside the process
A process at 742% is not one thing; it is some number of threads, and the next question is
which of them carry the heat. This matters because thread identity is often the diagnosis by
itself. A JVM where the hot threads are named GC Thread#0 through
#7 has a memory problem wearing a CPU costume. A service where one worker thread
out of forty is pegged has a stuck request. A Go binary where the heat spreads evenly across
runtime-managed threads has a workload problem. You get the thread view two ways:
top -H -p PID for an interactive screen, or pidstat -t -p PID 1 for
one-second samples you can paste into the incident channel.
$ top -H -p 41327 Threads: 312 total, 9 running, 303 sleeping, 0 stopped, 0 zombie PID USER PR NI VIRT RES S %CPU %MEM TIME+ COMMAND 41402 deploy 20 0 28.4g 6.1g R 97.4 19.6 84:12.33 GC Thread#0 41403 deploy 20 0 28.4g 6.1g R 96.8 19.6 84:01.91 GC Thread#1 41404 deploy 20 0 28.4g 6.1g R 96.1 19.6 83:55.40 GC Thread#2 41405 deploy 20 0 28.4g 6.1g R 95.7 19.6 83:49.07 GC Thread#3 41398 deploy 20 0 28.4g 6.1g S 12.3 19.6 9:44.10 http-nio-8080-e …the application threads are nearly idle; the collector owns the CPU
In thread view, the PID column shows thread IDs — each thread on Linux has its own TID, and
the COMMAND column shows the thread name the runtime assigned via
/proc/TID/comm. Well-behaved runtimes name their threads, and those names do a
lot of diagnostic work for free: in the capture above, four collector threads at 96% each
while the request threads sit at 12% is a complete story. The service is not slow because it
is doing too much work; it is slow because the heap is under so much allocation pressure that
the collector runs continuously. That is a memory investigation from here on — covered in
step 6 — and profiling functions would only tell you what you already know.
When you need the numbers in scrollback instead of a live screen, pidstat gives
the same answer with the user/system split per thread:
$ pidstat -t -p 31764 1 5 Average: UID TGID TID %usr %system %CPU CPU Command Average: 1001 31764 - 698.2 41.1 739.3 - feedsvc Average: 1001 - 31771 91.3 4.8 96.1 2 |__feedsvc Average: 1001 - 31772 89.7 5.5 95.2 5 |__feedsvc Average: 1001 - 31773 90.4 5.1 95.5 0 |__feedsvc …heat spread evenly across worker threads: systemic, profile the code (step 3)
One small bridge worth knowing if the process is a JVM: thread dumps print the TID in hex as
nid=0xa1ba, so printf '%x\n' 41402 turns the TID from
top -H into the key you grep a thread dump for. Thirty seconds of that and you
are reading the exact Java stack of the thread you watched burn.
perf accepts -t TID). Heat spread across workers → profile the
whole process. Either way, step 3.Step 3 — which function: perf top
You know the process and the thread. The last level of "where" is the function, and the tool
is perf. For an incident, start with perf top — a live, sampling
view of where cycles are going right now, no recording step, no files:
$ sudo perf top -p 18233 Samples: 96K of event 'cycles', 4000 Hz, Event count (approx.): 41203118821 Overhead Shared Object Symbol 41.7% libcrypto.so.3 [.] sha256_block_data_order_avx2 18.2% authd [.] verify_token 9.6% libc.so.6 [.] __memcmp_avx2_movbe 4.1% authd [.] parse_header 2.8% [kernel] [k] copy_user_enhanced_fast_string
Read it top down. Forty-two percent of all sampled cycles inside one SHA-256 routine, called
(per the next row) from token verification, is not a vague signal — it says this service
spends close to half its CPU hashing, and the question becomes "why are we hashing this
much," which is a code review question, not a systems question. Maybe a cache in front of
verification silently stopped caching. Maybe a client retries in a loop and every retry pays
full verification. The profiler's job ends at naming the function; yours continues into why
that function runs so often. When the live view is too jumpy or you want call chains, record
instead — perf record -g -p PID -- sleep 30, then perf report, or
render it as a flame graph where width is time and the widest tower is your answer. The
longer treatment of sampling profilers, flame graphs, and when to reach for off-CPU analysis
is in profiling.
An honest note before you trust the output, because this is where the demo and the production
box part ways. perf resolves addresses to names using symbols, and three things
routinely break that. Stripped binaries: a C or C++ service built without debug info shows
rows of raw hex addresses; install the matching -dbgsym /
debuginfo package or rebuild with frame pointers before the profile means
anything. JITed runtimes: the JVM, Node, and friends generate code at runtime that
perf has never heard of, so you get one giant anonymous region; they need a
perf-map bridge (async-profiler for the JVM, --perf-basic-prof for Node) — and
frankly, the runtime's own profiler (async-profiler, py-spy for Python) is usually the better
first tool there anyway. Containers: run perf from the host, expect PID
namespaces to renumber everything, expect the container's binaries to live under an overlay
path the symbol loader cannot guess, and on hardened or managed hosts expect
perf_event_paranoid to refuse you entirely. None of this is a reason to skip the
step; it is a reason to know in advance which flavour of friction you will hit, and to fall
back to the runtime profiler when the host fights you.
[k] rows) dominating → you are really in
%sy territory; go to step 4.Step 4 — when %sy is the story: count the syscalls
Suppose step 1 showed the inverted profile: us at 18, sy at 71.
Your code is barely computing. The kernel is doing the burning, and the kernel only burns on
your behalf when you ask — so something is asking at an enormous rate. The fastest way to see
the shape of the asking is strace -c, which counts syscalls instead of printing
them:
$ sudo timeout 10 strace -c -f -p 52214 strace: Process 52214 attached with 9 threads % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 62.18 6.413442 4 1483920 read 24.61 2.538710 5 470122 461020 futex 8.95 0.923301 3 301244 epoll_wait 2.40 0.247118 2 101831 write 1.86 0.191808 4 44360 recvfrom ------ ----------- ----------- --------- --------- ---------------- 100.00 10.314379 2401477 461020 total 1.48 million reads in ten seconds — 148k/s — at ~4 µs each: a tiny buffer in a tight loop
Two patterns account for most real cases, and both are visible in a table like this. The
first is the tight read loop: 1.48 million read calls in ten seconds means
something is reading a few bytes at a time — a buffer size of 1 (a classic misconfigured
client), an unbuffered logger, a poll loop that re-reads a file descriptor it should have
waited on. The fix is almost always buffering or batching, and the win is often 10x. The
second is the futex storm, and the errors column gives it away: 461k of 470k
futex calls failing (returning EAGAIN) means threads are hammering
a contended lock, waking, finding it still held, and going around again. That is lock
contention burning CPU as system time, and the cure lives in the application's locking
design, not in the kernel.
One warning that belongs in bold next to this step:
strace is heavy. It stops the target at every syscall entry and
exit, and on a process doing 150k syscalls a second that overhead is not a rounding error —
it can slow the target severalfold and turn your investigation into a second incident. Attach
briefly (timeout 10 as above), use -c rather than the full firehose,
and on anything truly hot prefer perf trace or a BPF tool like
syscount, which count from inside the kernel without stopping anyone. The full
story of what strace shows, what it costs, and how to read the firehose when you
do need it is in strace.
futex with a huge error count →
lock contention; go read the application's synchronisation. Syscall mix looks ordinary →
the %sy may be page-fault or network-softirq driven; check si and fault rates
before blaming the app.The decision tree, on one screen
Notice what the tree is really doing: every step either narrows "where" by one level (machine → process → thread → function) or reclassifies the problem out of the CPU domain entirely (disk, memory, hypervisor). If a step does neither, you are flailing — go back one level and re-read the output you already have.
The endings
Nearly every CPU incident resolves to one of four stories. Knowing them in advance changes how you read every step above, because each step is really asking "which ending am I in?"
Ending 1 — the runaway loop
One thread, one function, often traceable to a deploy or a config change within the last few
hours: an unbounded retry with no backoff, a parser that never terminates on one weird input,
a regex that backtracks for geological time on a particular string, a cache loop whose exit
condition a refactor quietly deleted. The profile is unmistakable — one tower in the flame
graph, one pegged core or one saturated pool. The fix is in the code, but the immediate
mitigation is operational: if one stuck worker is poisoning a box, restarting the process
buys you room, and doing that cleanly — which signal, in what order, with what grace period —
is the subject of kill & signals.
Capture the evidence first (a perf record, a thread dump, the offending request
from the logs) because a restart destroys the crime scene.
Ending 2 — GC thrash: memory pressure in a CPU costume
The hot threads belong to the garbage collector; the application threads are nearly idle yet latency is terrible. The CPU graph screams but the CPU is the victim, not the culprit: the heap is so close to full, or the allocation rate so high, that the collector runs back to back, burning cores to reclaim scraps. Tuning CPU here is pointless — the questions are why the heap is under pressure (a leak? a sudden working-set growth? an allocation storm from a bad code path?) and they belong to the memory investigation, which is the next page in this series: what's eating my memory? The tell, one more time, because it saves a wasted hour: collector threads hot, application threads idle. Check thread names before you profile.
Ending 3 — the noisy neighbour
On cloud VMs and oversubscribed hypervisors, %st — steal time — is the share of
time your VM had a runnable task but the hypervisor ran someone else's. A sustained
st of 10 means a tenth of the CPU you are paying for is going to a neighbour,
and no amount of profiling your own process will get it back. The burstable instance variant
is even more common: instance families that run on CPU credits will throttle you hard once
the credits drain, which looks exactly like sustained steal. Either way the investigation is
over on your side of the fence — confirm with the cloud provider's credit and steal metrics,
then move the workload or resize the instance. The trap is spending an hour profiling a
perfectly healthy service because the graphs looked guilty.
Ending 4 — it's actually fine
Sometimes you walk the whole path and find a service doing exactly what it should: traffic is up 40%, the profile is the normal profile only more so, every core does useful work, latency is within budget. High CPU utilisation is not inherently a problem — for a batch worker or a well-sized service it is the goal; you paid for those cores. The defect is in the alert, which fired on utilisation when the thing anyone cares about is saturation and latency. The honest resolution is to raise the threshold, or better, re-key the alert to run-queue depth or p99 latency, and write down why. An alert that cries wolf trains the on-call to ignore it, and the time that costs you arrives during a real incident.
A worked example, end to end
Here is the whole path run once, compressed the way it actually happens. The pager fires at
09:41: api-02 CPUUtilization > 85% for 10m. SSH in.
$ uptime; nproc 09:43:11 up 12 days, 22:04, 1 user, load average: 9.12, 8.77, 6.40 8 $ vmstat 1 3 r b swpd free buff cache si so bi bo in cs us sy id wa st 9 0 0 982160 211344 601724 0 0 0 24 9112 14233 84 11 4 0 1 10 0 0 981004 211344 601726 0 0 0 0 9376 14890 86 10 4 0 0 → r ≈ 9 on 8 cores, b = 0, wa = 0: genuinely compute-bound, and worsening. Continue.
Step 1, find the process. top -o %CPU shows feedsvc, the Go service
that assembles activity feeds, at 742%. Press 1: all eight cores between 85 and
90 us. So: systemic, user-space, one process. Not a stuck single thread, not the
kernel, not steal.
$ pidstat -t -p 31764 1 5 | tail -6 Average: 1001 31764 - 698.2 41.1 739.3 - feedsvc Average: 1001 - 31771 91.3 4.8 96.1 2 |__feedsvc Average: 1001 - 31772 89.7 5.5 95.2 5 |__feedsvc Average: 1001 - 31773 90.4 5.1 95.5 0 |__feedsvc → every worker thread equally hot. Whatever it is, every request pays it. Profile.
$ sudo perf top -p 31764 Samples: 214K of event 'cycles', 4000 Hz Overhead Shared Object Symbol 28.9% feedsvc [.] runtime.mallocgc 16.7% feedsvc [.] runtime.scanobject 14.0% feedsvc [.] regexp/syntax.(*parser).parse 8.6% feedsvc [.] runtime.gcBgMarkWorker.func2 5.1% feedsvc [.] runtime.memmove 3.9% feedsvc [.] regexp.compile
Read the profile as a sentence. Nearly half the cycles are allocator and collector
(mallocgc, scanobject, the background mark worker) — allocation
pressure. And the biggest non-runtime entry is regex compilation. A service does not
compile regexes 14% of the time unless someone is compiling per request. Check the morning's
deploy diff: a change to feed filtering moved a regexp.MustCompile call from
package init into the per-item filter loop. Every feed item, every request, recompiles the
same pattern — each compile allocating heavily, the collector running constantly to keep up.
The CPU symptom, the allocation cause, and the one-line diff all line up.
Resolution: hoist the compile back to init, deploy, watch feedsvc drop from 742%
to 9% within a minute of rollout. Total investigation time, about eight minutes — and notice
that not one step required guessing. Each output forced the next move: saturated and
compute-bound → which process → all threads equally → profile → allocator plus regex compile →
the diff. This was ending 1 with a side of ending 2: a runaway-allocation loop whose visible
cost was the collector. If the diff had shown nothing, the next move would have been the
memory page, because a profile dominated by collection with no obvious allocation site
usually means the heap itself is the story.
uptime (saturated or not, getting worse or better) ·
vmstat 1 5 (busy vs waiting: r, b, wa, st) · top -o %CPU, press
1 (which process, which cores, us vs sy) · pidstat -t -p PID 1 5
(which thread — and are they GC threads?) · sudo perf top -p PID (which
function). Each one runs in seconds, and after the five you either have the culprit or you
know which specialised investigation to open.What to write in the incident notes
An investigation that lives only in your terminal scrollback dies with the SSH session. The
write-up does not need to be long, but five things belong in it, in roughly this order.
First, the classification: which branch of the tree this was — compute-bound, iowait
masquerading as load, steal, GC — stated in one line, because the next person to see a
similar alert on this service will pattern-match against it. Second, the evidence chain:
paste the actual outputs — the vmstat lines, the top header, the
pidstat per-thread rows, the top five rows of perf — not prose
descriptions of them. Numbers can be re-examined later when someone doubts the conclusion;
"the CPU was high" cannot. Third, the root cause and the change that introduced it, with the
commit or deploy ID, because "regex compile moved into the loop in deploy 2026-06-08-3" is
actionable in a way that "inefficient code" never is. Fourth, what you did and what you
deliberately did not do — if you chose not to restart because you wanted the profile, say so;
if you killed a process, record the signal and the time so the gap in the service's metrics
has an explanation. Fifth, the follow-ups with owners: the code fix, the alert re-keying if
this was ending 4, the lint rule or review checklist item that would have caught the loop.
A CPU incident where the notes contain a profile, a diff, and a prevention item is an
incident the team only pays for once.
Further reading
- Brendan Gregg — Linux performance analysis in 60,000 milliseconds — the Netflix ten-command checklist this page's "fast version" descends from; worth internalising verbatim.
- Brendan Gregg — Linux load averages: solving the mystery — why Linux counts uninterruptible sleep in the load average, traced to the 1993 patch that did it.
- perf examples — brendangregg.com — the working reference for perf one-liners, flame graph generation, and the symbol-fixing rituals.
- Semicolony — Profiling — the deeper treatment of sampling, flame graphs, and off-CPU analysis that step 3 compresses.