13 / 28

Linux / 13

What's eating my CPU?

The alert says CPU is above 85% and has been for ten minutes. That is not a diagnosis; it is barely a symptom. Between that alert and a root cause there is a short, repeatable path: decide whether the box is busy or merely waiting, find the process, find the thread inside the process, find the function inside the thread, and recognise which of four endings you have landed in. This page walks that path the way a senior engineer walks it during a real incident — each step is a command, the output you will actually see, and the decision the output forces.

Step 0 — decide what kind of high

Before you hunt for a process, settle a prior question: is the CPU actually doing work, or is the machine just waiting in a way that makes the graphs look bad? The two have completely different investigations, and the alert cannot tell them apart. Most "high CPU" alerts are really "high load" or "high utilisation" alerts, and load is a count of demand, not a measure of compute.

Start with the cheapest possible reading: load average against core count.

$ uptime
 14:32:09 up 41 days,  3:17,  2 users,  load average: 9.12, 8.77, 6.40
$ nproc
8   ← 9.12 demand on 8 cores: saturated, and the 1-min > 15-min trend says it's getting worse

The three numbers are demand averaged over one, five, and fifteen minutes. Read them against nproc: a load of 9 on 8 cores means that, on average, nine tasks wanted a core at once and one of them was always queueing. The trend matters as much as the level — 9.12 rising from 6.40 is an event in progress; 9.12 falling from 14 is an event you already missed. But here is the catch that separates this from a real diagnosis: on Linux, the load average counts not only tasks running and waiting to run, but also tasks in uninterruptible sleep — usually waiting on disk I/O. A machine with a dying disk and an idle CPU can post a load of 40. So a high load average forks the investigation immediately, and one cheap command settles which branch you are on:

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa st
 9  0      0 982160 211344 601724    0    0     0    24 9112 14233 84 11  4  0  1
10  0      0 981004 211344 601726    0    0     0     0 9376 14890 86 10  4  0  0
 9  1      0 980512 211344 601728    0    0     0   132 9018 14507 85 11  3  1  0
         r = runnable tasks, b = blocked on I/O.  r ≈ 9, b ≈ 0, wa ≈ 0: the CPU is genuinely busy

Two columns decide everything. r is the run queue: tasks that want CPU right now. b is tasks blocked in uninterruptible sleep, almost always on storage. If r is high and b is near zero, with us and sy eating the CPU columns, the box is compute-bound and this page is your investigation. If instead you see r at 1, b at 7, and wa at 60-plus while id would be high if the disk ever answered — that is iowait. The CPU is not eating anything; it is starving politely while storage holds everyone up. That is a disk investigation, and continuing down this page would waste twenty minutes profiling a process that is mostly asleep. The general habit of checking utilisation, saturation, and errors per resource before committing to a theory is the USE method, and this step is that method applied in fifteen seconds. The machinery behind the run queue — why nine runnable tasks on eight cores means queueing, and how the kernel decides who runs — is the subject of scheduling.

The decision. Load high, wa high, b > 0, CPU columns low → it is the disk, go investigate storage. Load high, r high, us+sy high → the CPU is really burning, continue to step 1.

Step 1 — top: which process, and what shape

Now find the consumer. top sorted by CPU answers "which process" in five seconds, but it answers two more questions in the same screen if you know where to look, so do not glance at the first row and bail.

$ top -o %CPU
top - 14:33:05 up 41 days,  3:18,  2 users,  load average: 9.12, 8.77, 6.40
Tasks: 241 total,   3 running, 238 sleeping,   0 stopped,   0 zombie
%Cpu(s): 84.6 us, 10.9 sy,  0.0 ni,  3.7 id,  0.1 wa,  0.0 hi,  0.4 si,  0.3 st
MiB Mem :  31954.4 total,   2210.9 free,  21311.2 used,   8432.3 buff/cache

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  31764 deploy    20   0   12.1g   3.4g  41208 R 742.3  10.9 312:44.61 feedsvc
   1290 root      20   0  142936  38104  29412 S   4.3   0.1  88:10.02 nginx
    988 root      20   0 1893220  92140  54372 S   1.7   0.3 412:03.77 containerd

First read: one process, feedsvc, at 742% — process %CPU is per-core, so 742% on an 8-core box means it owns roughly seven and a half cores. That is your suspect. (If instead you find no single hot process, just dozens of small ones summing to saturation, the story is fan-out — a cron storm, a fork bomb, a worker pool sized wrong — and the fix is about how many processes exist, not what any one of them is doing.)

Second read: press 1. The summary line explodes into one line per core, and the shape of the heat tells you something the totals hide.

(inside top, press 1)
%Cpu0  : 88.2 us, 10.8 sy,  0.0 ni,  0.5 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
%Cpu1  : 86.0 us, 11.9 sy,  0.0 ni,  1.6 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
%Cpu2  : 89.1 us,  9.4 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
%Cpu3  : 85.7 us, 12.1 sy,  0.0 ni,  1.7 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
…all eight look like this: the load is systemic, not one runaway thread

Compare that with the other shape you will meet, where seven cores idle while one reads 99% us. One pegged core means a single hot thread: a single-threaded runtime maxed out, one stuck worker spinning, an infinite loop in one request handler. A process can never use more than one core per runnable thread, so "the service is at 100% CPU" on a 16-core machine sometimes means it is at 100% of the one core it can use while the box is 94% idle — saturated and underutilised at the same time. All cores hot means the work is spread: a busy thread pool, GC churning on parallel threads, or genuinely too much traffic. The two shapes lead to different next steps, which is why pressing 1 is not optional.

Third read: the us / sy split in the header. us is your code running in user space; sy is the kernel running on your behalf — servicing syscalls, copying buffers, handling page faults. A profile of 85/11 like the one above says the burn is in application code, so the path ahead is a profiler. If the ratio is inverted — 20 us, 70 sy — your code is barely computing; it is asking the kernel to do something at an absurd rate, and the path ahead is syscall counting (step 4). And keep one eye on st, steal time: CPU your VM wanted but the hypervisor gave to someone else. We will come back to it in the endings, because a high st ends the investigation on the spot.

The %Cpu(s) line as a budget. The us/sy split chooses your next tool; wa says it is not a CPU problem at all; st says it is not even your machine's problem.

This page deliberately uses top as a waypoint, not a destination — the full tour of its fields, the htop comparison, and the keystrokes worth knowing live in top & htop. For the investigation you need exactly three things from it: the PID, the core shape, and the us/sy/st split.

The decision. One hot process, %us dominant → step 2, find the thread. %sy dominant → step 4, count syscalls. %st meaningfully above zero → skip to the noisy-neighbour ending. No single hot process → look for fan-out (process count), not a culprit.

Step 2 — which thread inside the process

A process at 742% is not one thing; it is some number of threads, and the next question is which of them carry the heat. This matters because thread identity is often the diagnosis by itself. A JVM where the hot threads are named GC Thread#0 through #7 has a memory problem wearing a CPU costume. A service where one worker thread out of forty is pegged has a stuck request. A Go binary where the heat spreads evenly across runtime-managed threads has a workload problem. You get the thread view two ways: top -H -p PID for an interactive screen, or pidstat -t -p PID 1 for one-second samples you can paste into the incident channel.

$ top -H -p 41327
Threads: 312 total,   9 running, 303 sleeping,   0 stopped,   0 zombie
    PID USER      PR  NI    VIRT    RES  S  %CPU  %MEM     TIME+ COMMAND
  41402 deploy    20   0   28.4g   6.1g  R  97.4  19.6  84:12.33 GC Thread#0
  41403 deploy    20   0   28.4g   6.1g  R  96.8  19.6  84:01.91 GC Thread#1
  41404 deploy    20   0   28.4g   6.1g  R  96.1  19.6  83:55.40 GC Thread#2
  41405 deploy    20   0   28.4g   6.1g  R  95.7  19.6  83:49.07 GC Thread#3
  41398 deploy    20   0   28.4g   6.1g  S  12.3  19.6   9:44.10 http-nio-8080-e
…the application threads are nearly idle; the collector owns the CPU

In thread view, the PID column shows thread IDs — each thread on Linux has its own TID, and the COMMAND column shows the thread name the runtime assigned via /proc/TID/comm. Well-behaved runtimes name their threads, and those names do a lot of diagnostic work for free: in the capture above, four collector threads at 96% each while the request threads sit at 12% is a complete story. The service is not slow because it is doing too much work; it is slow because the heap is under so much allocation pressure that the collector runs continuously. That is a memory investigation from here on — covered in step 6 — and profiling functions would only tell you what you already know.

When you need the numbers in scrollback instead of a live screen, pidstat gives the same answer with the user/system split per thread:

$ pidstat -t -p 31764 1 5
Average:      UID      TGID       TID    %usr %system  %CPU   CPU  Command
Average:     1001     31764         -   698.2    41.1 739.3     -  feedsvc
Average:     1001         -     31771    91.3     4.8  96.1     2  |__feedsvc
Average:     1001         -     31772    89.7     5.5  95.2     5  |__feedsvc
Average:     1001         -     31773    90.4     5.1  95.5     0  |__feedsvc
…heat spread evenly across worker threads: systemic, profile the code (step 3)

One small bridge worth knowing if the process is a JVM: thread dumps print the TID in hex as nid=0xa1ba, so printf '%x\n' 41402 turns the TID from top -H into the key you grep a thread dump for. Thirty seconds of that and you are reading the exact Java stack of the thread you watched burn.

The decision. Hot threads are GC/collector threads → it is memory pressure, jump to the GC ending. One hot thread among many idle ones → profile that thread (perf accepts -t TID). Heat spread across workers → profile the whole process. Either way, step 3.

Step 3 — which function: perf top

You know the process and the thread. The last level of "where" is the function, and the tool is perf. For an incident, start with perf top — a live, sampling view of where cycles are going right now, no recording step, no files:

$ sudo perf top -p 18233
Samples: 96K of event 'cycles', 4000 Hz, Event count (approx.): 41203118821
Overhead  Shared Object        Symbol
  41.7%  libcrypto.so.3       [.] sha256_block_data_order_avx2
  18.2%  authd                [.] verify_token
   9.6%  libc.so.6            [.] __memcmp_avx2_movbe
   4.1%  authd                [.] parse_header
   2.8%  [kernel]             [k] copy_user_enhanced_fast_string

Read it top down. Forty-two percent of all sampled cycles inside one SHA-256 routine, called (per the next row) from token verification, is not a vague signal — it says this service spends close to half its CPU hashing, and the question becomes "why are we hashing this much," which is a code review question, not a systems question. Maybe a cache in front of verification silently stopped caching. Maybe a client retries in a loop and every retry pays full verification. The profiler's job ends at naming the function; yours continues into why that function runs so often. When the live view is too jumpy or you want call chains, record instead — perf record -g -p PID -- sleep 30, then perf report, or render it as a flame graph where width is time and the widest tower is your answer. The longer treatment of sampling profilers, flame graphs, and when to reach for off-CPU analysis is in profiling.

An honest note before you trust the output, because this is where the demo and the production box part ways. perf resolves addresses to names using symbols, and three things routinely break that. Stripped binaries: a C or C++ service built without debug info shows rows of raw hex addresses; install the matching -dbgsym / debuginfo package or rebuild with frame pointers before the profile means anything. JITed runtimes: the JVM, Node, and friends generate code at runtime that perf has never heard of, so you get one giant anonymous region; they need a perf-map bridge (async-profiler for the JVM, --perf-basic-prof for Node) — and frankly, the runtime's own profiler (async-profiler, py-spy for Python) is usually the better first tool there anyway. Containers: run perf from the host, expect PID namespaces to renumber everything, expect the container's binaries to live under an overlay path the symbol loader cannot guess, and on hardened or managed hosts expect perf_event_paranoid to refuse you entirely. None of this is a reason to skip the step; it is a reason to know in advance which flavour of friction you will hit, and to fall back to the runtime profiler when the host fights you.

The decision. A clear hot function → you have your root cause candidate; go match it to a recent change and pick an ending. A wall of hex → fix symbols or switch to the runtime's profiler. Kernel symbols ([k] rows) dominating → you are really in %sy territory; go to step 4.

Step 4 — when %sy is the story: count the syscalls

Suppose step 1 showed the inverted profile: us at 18, sy at 71. Your code is barely computing. The kernel is doing the burning, and the kernel only burns on your behalf when you ask — so something is asking at an enormous rate. The fastest way to see the shape of the asking is strace -c, which counts syscalls instead of printing them:

$ sudo timeout 10 strace -c -f -p 52214
strace: Process 52214 attached with 9 threads
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 62.18    6.413442           4   1483920           read
 24.61    2.538710           5    470122    461020    futex
  8.95    0.923301           3    301244           epoll_wait
  2.40    0.247118           2    101831           write
  1.86    0.191808           4     44360           recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00   10.314379                2401477    461020 total
1.48 million reads in ten seconds — 148k/s — at ~4 µs each: a tiny buffer in a tight loop

Two patterns account for most real cases, and both are visible in a table like this. The first is the tight read loop: 1.48 million read calls in ten seconds means something is reading a few bytes at a time — a buffer size of 1 (a classic misconfigured client), an unbuffered logger, a poll loop that re-reads a file descriptor it should have waited on. The fix is almost always buffering or batching, and the win is often 10x. The second is the futex storm, and the errors column gives it away: 461k of 470k futex calls failing (returning EAGAIN) means threads are hammering a contended lock, waking, finding it still held, and going around again. That is lock contention burning CPU as system time, and the cure lives in the application's locking design, not in the kernel.

One warning that belongs in bold next to this step: strace is heavy. It stops the target at every syscall entry and exit, and on a process doing 150k syscalls a second that overhead is not a rounding error — it can slow the target severalfold and turn your investigation into a second incident. Attach briefly (timeout 10 as above), use -c rather than the full firehose, and on anything truly hot prefer perf trace or a BPF tool like syscount, which count from inside the kernel without stopping anyone. The full story of what strace shows, what it costs, and how to read the firehose when you do need it is in strace.

The decision. One syscall dominating the count → find the loop that issues it (the fix is usually batching or buffering). futex with a huge error count → lock contention; go read the application's synchronisation. Syscall mix looks ordinary → the %sy may be page-fault or network-softirq driven; check si and fault rates before blaming the app.

The decision tree, on one screen

The whole investigation as branch points. The verdant spine is the common path; the dashed boxes are early exits where the answer is already known.

Notice what the tree is really doing: every step either narrows "where" by one level (machine → process → thread → function) or reclassifies the problem out of the CPU domain entirely (disk, memory, hypervisor). If a step does neither, you are flailing — go back one level and re-read the output you already have.

The endings

Nearly every CPU incident resolves to one of four stories. Knowing them in advance changes how you read every step above, because each step is really asking "which ending am I in?"

Ending 1 — the runaway loop

One thread, one function, often traceable to a deploy or a config change within the last few hours: an unbounded retry with no backoff, a parser that never terminates on one weird input, a regex that backtracks for geological time on a particular string, a cache loop whose exit condition a refactor quietly deleted. The profile is unmistakable — one tower in the flame graph, one pegged core or one saturated pool. The fix is in the code, but the immediate mitigation is operational: if one stuck worker is poisoning a box, restarting the process buys you room, and doing that cleanly — which signal, in what order, with what grace period — is the subject of kill & signals. Capture the evidence first (a perf record, a thread dump, the offending request from the logs) because a restart destroys the crime scene.

Ending 2 — GC thrash: memory pressure in a CPU costume

The hot threads belong to the garbage collector; the application threads are nearly idle yet latency is terrible. The CPU graph screams but the CPU is the victim, not the culprit: the heap is so close to full, or the allocation rate so high, that the collector runs back to back, burning cores to reclaim scraps. Tuning CPU here is pointless — the questions are why the heap is under pressure (a leak? a sudden working-set growth? an allocation storm from a bad code path?) and they belong to the memory investigation, which is the next page in this series: what's eating my memory? The tell, one more time, because it saves a wasted hour: collector threads hot, application threads idle. Check thread names before you profile.

Ending 3 — the noisy neighbour

On cloud VMs and oversubscribed hypervisors, %st — steal time — is the share of time your VM had a runnable task but the hypervisor ran someone else's. A sustained st of 10 means a tenth of the CPU you are paying for is going to a neighbour, and no amount of profiling your own process will get it back. The burstable instance variant is even more common: instance families that run on CPU credits will throttle you hard once the credits drain, which looks exactly like sustained steal. Either way the investigation is over on your side of the fence — confirm with the cloud provider's credit and steal metrics, then move the workload or resize the instance. The trap is spending an hour profiling a perfectly healthy service because the graphs looked guilty.

Ending 4 — it's actually fine

Sometimes you walk the whole path and find a service doing exactly what it should: traffic is up 40%, the profile is the normal profile only more so, every core does useful work, latency is within budget. High CPU utilisation is not inherently a problem — for a batch worker or a well-sized service it is the goal; you paid for those cores. The defect is in the alert, which fired on utilisation when the thing anyone cares about is saturation and latency. The honest resolution is to raise the threshold, or better, re-key the alert to run-queue depth or p99 latency, and write down why. An alert that cries wolf trains the on-call to ignore it, and the time that costs you arrives during a real incident.

A worked example, end to end

Here is the whole path run once, compressed the way it actually happens. The pager fires at 09:41: api-02 CPUUtilization > 85% for 10m. SSH in.

$ uptime; nproc
 09:43:11 up 12 days, 22:04,  1 user,  load average: 9.12, 8.77, 6.40
8
$ vmstat 1 3
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa st
 9  0      0 982160 211344 601724    0    0     0    24 9112 14233 84 11  4  0  1
10  0      0 981004 211344 601726    0    0     0     0 9376 14890 86 10  4  0  0
  → r ≈ 9 on 8 cores, b = 0, wa = 0: genuinely compute-bound, and worsening. Continue.

Step 1, find the process. top -o %CPU shows feedsvc, the Go service that assembles activity feeds, at 742%. Press 1: all eight cores between 85 and 90 us. So: systemic, user-space, one process. Not a stuck single thread, not the kernel, not steal.

$ pidstat -t -p 31764 1 5 | tail -6
Average:     1001     31764         -   698.2    41.1 739.3    -  feedsvc
Average:     1001         -     31771    91.3     4.8  96.1    2  |__feedsvc
Average:     1001         -     31772    89.7     5.5  95.2    5  |__feedsvc
Average:     1001         -     31773    90.4     5.1  95.5    0  |__feedsvc
  → every worker thread equally hot. Whatever it is, every request pays it. Profile.

$ sudo perf top -p 31764
Samples: 214K of event 'cycles', 4000 Hz
Overhead  Shared Object   Symbol
  28.9%  feedsvc         [.] runtime.mallocgc
  16.7%  feedsvc         [.] runtime.scanobject
  14.0%  feedsvc         [.] regexp/syntax.(*parser).parse
   8.6%  feedsvc         [.] runtime.gcBgMarkWorker.func2
   5.1%  feedsvc         [.] runtime.memmove
   3.9%  feedsvc         [.] regexp.compile

Read the profile as a sentence. Nearly half the cycles are allocator and collector (mallocgc, scanobject, the background mark worker) — allocation pressure. And the biggest non-runtime entry is regex compilation. A service does not compile regexes 14% of the time unless someone is compiling per request. Check the morning's deploy diff: a change to feed filtering moved a regexp.MustCompile call from package init into the per-item filter loop. Every feed item, every request, recompiles the same pattern — each compile allocating heavily, the collector running constantly to keep up. The CPU symptom, the allocation cause, and the one-line diff all line up.

Resolution: hoist the compile back to init, deploy, watch feedsvc drop from 742% to 9% within a minute of rollout. Total investigation time, about eight minutes — and notice that not one step required guessing. Each output forced the next move: saturated and compute-bound → which process → all threads equally → profile → allocator plus regex compile → the diff. This was ending 1 with a side of ending 2: a runaway-allocation loop whose visible cost was the collector. If the diff had shown nothing, the next move would have been the memory page, because a profile dominated by collection with no obvious allocation site usually means the heap itself is the story.

The fast version. When you have two minutes, not twenty, this is the five-command path: uptime (saturated or not, getting worse or better) · vmstat 1 5 (busy vs waiting: r, b, wa, st) · top -o %CPU, press 1 (which process, which cores, us vs sy) · pidstat -t -p PID 1 5 (which thread — and are they GC threads?) · sudo perf top -p PID (which function). Each one runs in seconds, and after the five you either have the culprit or you know which specialised investigation to open.

What to write in the incident notes

An investigation that lives only in your terminal scrollback dies with the SSH session. The write-up does not need to be long, but five things belong in it, in roughly this order. First, the classification: which branch of the tree this was — compute-bound, iowait masquerading as load, steal, GC — stated in one line, because the next person to see a similar alert on this service will pattern-match against it. Second, the evidence chain: paste the actual outputs — the vmstat lines, the top header, the pidstat per-thread rows, the top five rows of perf — not prose descriptions of them. Numbers can be re-examined later when someone doubts the conclusion; "the CPU was high" cannot. Third, the root cause and the change that introduced it, with the commit or deploy ID, because "regex compile moved into the loop in deploy 2026-06-08-3" is actionable in a way that "inefficient code" never is. Fourth, what you did and what you deliberately did not do — if you chose not to restart because you wanted the profile, say so; if you killed a process, record the signal and the time so the gap in the service's metrics has an explanation. Fifth, the follow-ups with owners: the code fix, the alert re-keying if this was ending 4, the lint rule or review checklist item that would have caught the loop. A CPU incident where the notes contain a profile, a diff, and a prevention item is an incident the team only pays for once.

What's eating my CPU?

Step 0 — decide what kind of high

Step 1 — top: which process, and what shape

Step 2 — which thread inside the process

Step 3 — which function: perf top

Step 4 — when %sy is the story: count the syscalls

The decision tree, on one screen

The endings

Ending 1 — the runaway loop

Ending 2 — GC thrash: memory pressure in a CPU costume

Ending 3 — the noisy neighbour

Ending 4 — it's actually fine

A worked example, end to end

What to write in the incident notes

Further reading

14 — What's eating my memory?