The box is slow
Someone says it, and that is all they say. Not "CPU is high", not "the disk is full" — just slow. The pages before this one each chase a named suspect; this one is for the minute before you have a suspect at all. Eight commands, run in order, each answering one question about one resource, and by the end of them you know which of four investigations to open. It is the first sixty seconds of every incident, and it is the same sixty seconds every time, which is exactly why it works.
What this minute is for
"Slow" is the least specific complaint a system can produce, and the natural reaction to it is the worst one: jumping straight to the tool you know best. The engineer who loves profilers profiles. The engineer who suspects the database reads slow-query logs. Both might be right, but neither has earned the right to be, because neither has checked whether the box is even the problem. The discipline this page teaches is to spend one minute being systematic before spending an hour being clever.
The minute has a shape, and the shape comes from the USE method: for every resource, check utilisation (how busy is it), saturation (is work queueing behind it), and errors (is it failing outright). A server has four resources that matter at triage time — CPU, memory, disk, network — so the whole exercise is twelve questions, and it turns out eight commands answer all twelve. The ordering and most of the commands descend from Brendan Gregg's sixty-second checklist, the one the Netflix performance team published and half the industry quietly adopted; the further-reading section points at the original, and you should read it. What this page adds is the routing: not just what each command prints, but which output sends you down which path, and which page in this codex is waiting at the end of each path.
One ground rule before the first command. During these sixty seconds you are not fixing anything, and you are not forming theories. You are taking readings. The moment you let yourself think "it's probably the database again", you start reading the output for confirmation instead of information, and the one time it is not the database you will burn an hour proving it to yourself. Run the eight, read the eight, then theorise.
Step 1 — uptime: is the box loaded at all, and which way is it going
The cheapest reading first. One command, one line, three numbers.
$ uptime 10:07:14 up 87 days, 14:02, 3 users, load average: 11.40, 7.12, 3.88 $ nproc 8 ← 11.4 demand on 8 cores, and 11.4 > 7.1 > 3.9: saturated and getting worse fast
The three numbers are demand averaged over one, five, and fifteen minutes, and they mean
nothing until you divide by core count. A load of 11 on a 96-core machine is a quiet
afternoon; on 8 cores it means three tasks are queueing at any moment. The comparison
against nproc gives you utilisation-and-saturation in one glance, which is
why this is the first command and not top.
The second read is the trend, and it is often worth more than the level. Three numbers at three time scales make a tiny chart: 11.4 over 7.1 over 3.9 says the problem started in the last few minutes and is still building, so whatever you do next, do it quickly. The reverse shape, 3.9 over 7.1 over 11.4, says the storm already passed; the user who reported slowness was telling the truth ten minutes ago, the evidence is evaporating, and your priority shifts from live commands to logs and dashboards before the trail goes cold. Flat and high means a steady state you can investigate calmly. Flat and low means the box is not loaded at all, which is itself a finding: keep running the checklist, because "slow" with no load usually ends at the network or the application, but you have already ruled out the most common story.
One caveat to carry through everything that follows: on Linux, the load average counts tasks in uninterruptible sleep as well as tasks that want CPU. A machine waiting on a dying disk can post a load of 40 with an idle processor. So a high number here does not yet tell you which resource is in trouble, only that something is. That is fine. This step's job is "is there a fire and is it growing", not "where".
nproc → something is
saturated; the rest of the checklist finds what. Load low and flat → keep going anyway,
but lean network/app. Rising → hurry. Falling → start preserving evidence.Step 2 — dmesg: is anything screaming
Before measuring anything subtle, check whether the kernel is already shouting the answer. Ten seconds, and it occasionally ends the whole investigation.
$ dmesg -T | tail -20 [Mon Jun 8 09:58:31 2026] TCP: out of memory -- consider tuning tcp_mem [Mon Jun 8 10:01:12 2026] Out of memory: Killed process 22114 (worker) total-vm:6483920kB, anon-rss:4112340kB [Mon Jun 8 10:01:12 2026] oom_reaper: reaped process 22114 (worker), now anon-rss:0kB [Mon Jun 8 10:04:58 2026] blk_update_request: I/O error, dev sdb, sector 488397168 …any of these lines, by itself, reroutes the rest of the hour
You are scanning for a handful of patterns, and each one is a finished diagnosis wearing a log line. An OOM kill means the kernel ran out of memory and shot a process; if the victim is your service or something it depends on, "slow" is really "dying and restarting", and you can skip straight to the memory path. I/O errors against a device mean the storage layer is failing, not merely busy, and that is a hardware-or-driver conversation rather than a performance one — remember errors are the E in USE, and they outrank everything else when present. Filesystem remounts to read-only, NIC link flaps, hung-task warnings, TCP memory pressure: all of them are the kernel filing an incident report nobody read.
Most days the tail is boring, and boring is information too: it means the slowness is a matter of degree, not of breakage, and the remaining six commands will find it. The wider craft of reading the kernel ring buffer and its journald-flavoured cousin — how far back the buffer goes, what survives a reboot, how to filter by priority — lives in journalctl & dmesg. At triage time you need only the tail and the nerve to actually read it.
Step 3 — vmstat: the whole-box X-ray, and the routing decision
This is the step the rest of the page pivots on. One command shows every resource at once, one line per second, and the columns sort the incident into a lane. Always ask for several samples — the first line is an average since boot and is worthless; what you want is the rhythm of the next four.
$ vmstat 1 5 procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 204800 412116 88204 8201340 0 0 112 356 4211 8120 21 7 70 2 0 3 0 204800 410988 88204 8201422 0 0 96 412 4577 8455 24 8 66 2 0 2 1 204800 409732 88208 8201510 0 0 204 388 4302 8218 22 7 69 2 0 r: runnable tasks b: blocked, almost always on disk si/so: swap in/out per second us/sy: CPU in your code / in the kernel id: idle wa: idle while disk I/O is outstanding
Read it column by column, because each group answers one of the USE questions for one
resource. r is the run queue: how many tasks want a core right now, the
running ones included. Compare it to core count exactly as you compared load average,
except this number is instantaneous and does not mix in disk waiters, so it is the honest
CPU-saturation reading the load average only gestures at. b is tasks parked
in uninterruptible sleep, which in practice means blocked on storage; any persistent
nonzero value here is a disk queue forming. si and so are pages
swapped in and out per second, and they are the single most alarming pair on the screen:
sustained nonzero values mean the box does not have enough memory for its working set and
is paying disk prices for memory access. Not "swap is used" — old, cold pages sitting in
swap are harmless — but swap traffic, happening now.
Then the CPU block, which is a budget for where each second went. us is your
code, sy is the kernel working on your code's behalf, id is
truly idle, and wa is the strange one: the CPU sitting idle specifically
because disk I/O is outstanding. High wa does not mean the CPU is busy; it
means the CPU is a customer waiting at the storage counter. The full tour of every column,
and of why free output confuses everyone the first time, is in
free & vmstat — at triage time you
need the six highlighted ones.
And that is the routing decision, the one moment in the sixty seconds where you actually
choose a direction. r persistently at or above core count, with
us plus sy eating the budget: the CPU lane, step 4.
si/so sustained above zero: the memory lane, step 5, and it wins
any tie, because a swapping box corrupts every other reading — processes block on swap-in
and inflate b, the disk gets busy with swap traffic, everything looks guilty
at once, and memory is the cause of all of it. wa high with b
nonzero while r stays modest: the disk lane, step 6. And all of it quiet —
run queue short, no swap traffic, CPU mostly idle, wa near zero — is the
fourth lane, the one juniors miss: the box is fine and the slowness lives somewhere this
screen cannot see. Step 7.
r high → CPU path. si/so
nonzero → memory path, and it overrides the others. wa/b up →
disk path. All quiet → network or the application itself.Step 4 — the CPU path: pidstat names the consumer
If vmstat routed you here, the box is compute-bound and the question collapses to "who".
pidstat at one-second intervals gives a rolling per-process answer that, unlike
a screenshot of top, pastes cleanly into the incident channel with its own
averages at the bottom.
$ pidstat 1 5 Average: UID PID %usr %system %guest %wait %CPU CPU Command Average: 1001 31764 612.4 38.2 0.00 14.1 650.6 - feedsvc Average: 0 1290 3.1 1.2 0.00 0.4 4.3 - nginx Average: 998 2114 1.8 0.9 0.00 0.2 2.7 - postgres one process owns six and a half of eight cores; %wait > 0 says it wants even more
Two readings and you are done here. First, is there a single dominant consumer or a crowd?
One process at 650% is a suspect with a name; forty processes at 15% each is a fan-out
problem — a cron storm, an oversized worker pool, a container limit that stopped limiting
— and the fix concerns how many of them exist, not what any one is doing. Second, the
%usr versus %system split: code versus kernel-on-behalf-of-code,
which decides whether the deeper investigation reaches for a profiler or a syscall
counter.
Either way, the triage is over and the specialist investigation begins. The full CPU walk — busy versus waiting, process to thread to function to syscall, and the four endings every CPU incident resolves to — is what's eating my CPU?, and you arrive at its step 1 with the routing already done.
Step 5 — the memory path: free -h, read correctly
If si/so were moving, confirm the squeeze and size it.
$ free -h total used free shared buff/cache available Mem: 31Gi 29Gi 241Mi 1.1Gi 1.8Gi 412Mi Swap: 2.0Gi 1.9Gi 88Mi available is the number that matters: 412Mi of headroom on a 31Gi box, swap nearly full
Resist the reflex to read the free column; it is nearly always small on a
healthy machine, because Linux spends idle memory on the page cache and gives it back on
demand. The honest number is available: the kernel's estimate of how much
memory could be handed to a new allocation without swapping. Here it is 412Mi against
31Gi, with swap 95% occupied and — from vmstat — actively churning. That is a box past the
edge: the working set no longer fits, every page the kernel reclaims is a page something
needed, and the cost surfaces as latency everywhere at once, which is exactly why the
complaint arrived as "slow" rather than as anything more specific. A second tell is
buff/cache crushed down to 1.8Gi on a machine this size; the cache is the
first thing reclaim eats, and a skinny cache means file reads that used to be free now hit
the disk, so the memory shortage quietly becomes a disk problem too.
From here the question is which process grew and why, whether it is a leak or a legitimate
working-set change, and what the OOM killer will do about it if you do nothing. That
investigation, including how to read RSS honestly and what the kill scoring means, is
what's eating my memory? —
go there with the free output in hand.
Step 6 — the disk path: iostat, then the writer's name
If wa was high and b nonzero, the storage layer is the queue.
iostat -xz 1 shows per-device truth; the x gets the extended
columns that matter and the z hides idle devices.
$ iostat -xz 1 Device r/s w/s rkB/s wkB/s r_await w_await aqu-sz %util nvme0n1 38.0 614.0 1520.0 311296.0 1.2 48.7 29.4 99.6 nvme1n1 12.0 22.0 488.0 310.0 0.4 0.9 0.1 3.8 one device pinned at 99.6% with 29 requests queued and writes waiting 48ms each; its neighbour is idle — this is a single hot device, not a saturated controller
Three columns carry the verdict. %util is the share of time the device had at
least one request in flight — utilisation, the U. A device at 99.6% for seconds at a time
is busy every moment you looked. aqu-sz is the average queue depth —
saturation, the S — and it is the more damning number: 29 requests waiting means arrivals
outrun completions and every new request joins a line. And await (split into
reads and writes on modern sysstat) is what the queue costs: each write here waits 48
milliseconds from issue to completion on a device whose hardware service time is a
fraction of a millisecond. The latency is the queue, not the disk. One honest caveat
before you quote %util in the incident channel: on SSDs and anything
RAID-like, which serve many requests in parallel, 100% util does not mean the device is
out of capacity — it means at least one request was always in flight. Queue depth and
await are the numbers that prove actual suffering.
A saturated device is half an answer; the other half is who is saturating it. Per-process I/O comes from the same tool that named the CPU consumer:
$ pidstat -d 1 5 Average: UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command Average: 0 48211 0.0 297410.2 0.0 2 backup-agent Average: 998 2114 412.5 1184.7 6.1 64 postgres the writer writes 297 MB/s and barely waits; the database writes 1 MB/s and queues behind it
Read the iodelay column against the throughput columns and the power
structure of the device is laid bare: the backup agent produces 99% of the writes and
suffers almost none of the delay, while the database, doing a trickle of I/O, eats the
queue the backup built. The victim and the culprit are different processes, which is the
usual shape of a disk incident and the reason "which process is slow" and "which process
caused it" are different questions.
Step 7 — the network and app path: ss -s, then off the box
If every local resource came back clean, the slowness is in flight or in logic. The first
reading is the socket summary, which is to the network what uptime was to the
CPU — one cheap line that says whether anything is obviously off.
$ ss -s Total: 1412 TCP: 1389 (estab 212, closed 941, orphaned 3, timewait 902) Transport Total IP IPv6 RAW 0 0 0 UDP 12 9 3 TCP 448 430 18 902 sockets in timewait on a service that should hold a few dozen long-lived connections: something is opening and closing a connection per request
You are pattern-matching, not measuring. A timewait population in the hundreds or thousands on a service that should pool its connections means churn — a connection per request, somewhere, paying a setup cost on every call. An orphaned count climbing means closed sockets still draining to a peer that has stopped reading. Established connections far above or below what the service's architecture predicts is a question all by itself. None of these is proof; each is a thread to pull, and the pulling — retransmits, drops, listen queue overflows, and the order in which to suspect them — is the subject of is it the network?, which picks up exactly where this summary leaves off.
The routing, on one screen
Notice the structure: the spine is unconditional, the lanes are exclusive, and every lane checks the same three things for its resource — how busy, how queued, how broken. If you find yourself running commands that do not fit a lane, you have started theorising; go back to the last reading you trusted.
When nothing is saturated
The fourth ending deserves its own section because it is the one engineers handle worst. You ran the eight. Load is modest, the kernel is calm, the run queue is short, swap is still, the disks are loafing, the sockets look normal. The temptation is to distrust the readings and run them again, louder. Do not. A clean triage is a finding, and a valuable one: the box is fine, so the slowness lives in one of three places this checklist cannot see.
Upstream or downstream of the box, first. The service is slow because something it calls is slow — a database on another host, a third-party API, a DNS resolver taking its time. From this machine's point of view, waiting on a slow dependency is indistinguishable from idleness; the threads sit parked in network reads, consuming nothing. Your latency dashboards, broken down by dependency, answer this faster than any further command here can, and the per-connection forensics live in is it the network?
Inside the application, second. Software is very good at being slow without using any resource the kernel can see. Lock contention: fifty threads serialising politely through one mutex produce near-zero CPU and terrible latency. Pool exhaustion: a connection pool sized at ten under load that needs forty makes every request queue at the pool, invisible to vmstat because the queue is a data structure, not a run queue. A garbage collector pausing the world a hundred times a minute charges its cost as latency, not as sustained utilisation. The tells are application-level: thread dumps full of parked threads, pool metrics pinned at their maximum, GC logs with pause times that match the complaint. A quick look at process states in top & htop helps here — a service that is "slow" while every thread sleeps is confessing that it is waiting on something — and a syscall trace with strace will show threads blocked in futex or read calls, naming the wait directly.
And occasionally: nowhere, because the complaint is wrong. The user's wifi, a stale cache on their side, a dashboard averaging two populations. Before you spend the afternoon, get one number that pins the slowness to a request ID or a timestamp you can find in your own telemetry. "Slow" that cannot produce a single slow request is weather, not a system problem.
A worked example, end to end
The pager fires at 02:13: p99 latency on api-04 tripled. Nothing deployed
since Friday. SSH in and run the spine.
$ uptime; nproc 02:14:51 up 64 days, 9:12, 1 user, load average: 13.80, 11.21, 6.02 8 $ dmesg -T | tail …routine lines only, nothing screaming $ vmstat 1 5 r b swpd free buff cache si so bi bo in cs us sy id wa st 1 6 102400 822416 41200 9904188 0 0 88 301244 6120 9412 8 5 46 41 0 2 7 102400 821980 41200 9904212 0 0 120 298332 6080 9388 9 6 44 41 0 → load 13.8 but r ≈ 2: the load is not CPU demand. b at 6-7 and wa at 41: half a dozen tasks permanently blocked on disk, and 300 MB/s of writes. Disk lane.
Load of 13.8 on 8 cores looked like a CPU story for exactly as long as it took vmstat to
print. The run queue is nearly empty; the load average is being inflated by the blocked
column, those six or seven tasks in uninterruptible sleep, which Linux counts as load.
No swap traffic, so memory is clear. wa at 41 with bo around
300,000 kB per second routes this to step 6.
$ iostat -xz 1 Device r/s w/s rkB/s wkB/s r_await w_await aqu-sz %util nvme0n1 44.0 598.0 1760.0 306176.0 9.8 52.3 31.2 99.2 $ pidstat -d 1 5 Average: UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command Average: 0 61844 0.0 301228.4 0.0 3 backup-agent Average: 998 2114 388.2 1022.6 4.8 71 postgres → one device at 99% with a 31-deep queue and 52ms writes; one process producing the writes. The 02:00 backup is flooding the same NVMe the database lives on.
Three commands past the spine and the story is whole: the nightly backup job started at
02:00, writes 300 MB/s to the same device that serves the database's WAL and reads, the
device queue went to 31, every database fsync now waits 52 milliseconds behind backup
blocks, and the API's p99 — which is mostly database time — tripled. The fix at 02:20 is
operational: pause or throttle the backup (an ionice class, a bandwidth cap in the agent's
config) and watch aqu-sz collapse and p99 follow within a minute. The fix at
10:00 is structural: backups write to a different device or a different window, and the
alert that fired learns to annotate itself with iostat so the next person starts at step
6. Total time from page to mitigation, under ten minutes, and not one step required a
guess — uptime said "saturated", vmstat said "disk, not CPU", iostat named the device,
pidstat named the process, cron named the human.
uptime (loaded? trending?) · dmesg | tail (anything screaming?)
· vmstat 1 5 (the router: r, b, si/so, us/sy/id/wa) · then the lane it picks:
pidstat 1 (CPU: who?) · free -h (memory: read
available) · iostat -xz 1 (disk: %util, await, aqu-sz) ·
pidstat -d 1 (disk: who's writing?) · ss -s (network: churn,
orphans). Sixty seconds, and you exit knowing which specialist page to open — or knowing
the box is innocent, which is worth just as much.What to write in the incident notes
The triage produces a small, specific artifact, and writing it down takes two minutes
while the commands are still in scrollback. First, the route: which lane vmstat picked
and why, in one line — "load 13.8 was b-driven, wa 41, routed to disk" tells the next
responder more than a paragraph of impressions. Second, the raw readings: paste the
actual uptime line, the two or three vmstat samples, the iostat row, the
pidstat rows. Outputs can be re-read when the conclusion is challenged; adjectives
cannot. Third, the culprit and the mechanism, named as specifically as the evidence
allows: not "disk contention" but "backup-agent wrote 300 MB/s to nvme0n1 starting 02:00,
queue depth 31, database fsyncs waited 52ms". Fourth, what you changed and when, to the
minute, so the recovery visible in every dashboard has a label. Fifth, the prevention
item with an owner: move the backup, cap its bandwidth, re-key the alert. And one habit
worth stealing: if the triage came back clean, write that down too, with the readings.
"Checked all four lanes at 02:15, box clean, escalating to the app team" saves the next
responder from re-running your minute and saves you from the suspicion that you skipped
it.
Further reading
- Brendan Gregg — Linux performance analysis in 60,000 milliseconds — the original sixty-second checklist this page descends from; ten commands, annotated, from the Netflix performance team.
- Brendan Gregg — The USE method — the source treatment of utilisation, saturation, errors, with per-resource checklists that go far past the four lanes here.
- Linux performance — brendangregg.com — the tool-map diagram worth keeping open: every observability tool placed on a drawing of the kernel.
- Semicolony — The USE method — this codex's treatment of the method the triage is shaped by, including where it stops working.