28 / 28
Linux / 28

The box is slow

Someone says it, and that is all they say. Not "CPU is high", not "the disk is full" — just slow. The pages before this one each chase a named suspect; this one is for the minute before you have a suspect at all. Eight commands, run in order, each answering one question about one resource, and by the end of them you know which of four investigations to open. It is the first sixty seconds of every incident, and it is the same sixty seconds every time, which is exactly why it works.


What this minute is for

"Slow" is the least specific complaint a system can produce, and the natural reaction to it is the worst one: jumping straight to the tool you know best. The engineer who loves profilers profiles. The engineer who suspects the database reads slow-query logs. Both might be right, but neither has earned the right to be, because neither has checked whether the box is even the problem. The discipline this page teaches is to spend one minute being systematic before spending an hour being clever.

The minute has a shape, and the shape comes from the USE method: for every resource, check utilisation (how busy is it), saturation (is work queueing behind it), and errors (is it failing outright). A server has four resources that matter at triage time — CPU, memory, disk, network — so the whole exercise is twelve questions, and it turns out eight commands answer all twelve. The ordering and most of the commands descend from Brendan Gregg's sixty-second checklist, the one the Netflix performance team published and half the industry quietly adopted; the further-reading section points at the original, and you should read it. What this page adds is the routing: not just what each command prints, but which output sends you down which path, and which page in this codex is waiting at the end of each path.

One ground rule before the first command. During these sixty seconds you are not fixing anything, and you are not forming theories. You are taking readings. The moment you let yourself think "it's probably the database again", you start reading the output for confirmation instead of information, and the one time it is not the database you will burn an hour proving it to yourself. Run the eight, read the eight, then theorise.

Step 1 — uptime: is the box loaded at all, and which way is it going

The cheapest reading first. One command, one line, three numbers.

$ uptime
 10:07:14 up 87 days, 14:02,  3 users,  load average: 11.40, 7.12, 3.88
$ nproc
8   ← 11.4 demand on 8 cores, and 11.4 > 7.1 > 3.9: saturated and getting worse fast

The three numbers are demand averaged over one, five, and fifteen minutes, and they mean nothing until you divide by core count. A load of 11 on a 96-core machine is a quiet afternoon; on 8 cores it means three tasks are queueing at any moment. The comparison against nproc gives you utilisation-and-saturation in one glance, which is why this is the first command and not top.

The second read is the trend, and it is often worth more than the level. Three numbers at three time scales make a tiny chart: 11.4 over 7.1 over 3.9 says the problem started in the last few minutes and is still building, so whatever you do next, do it quickly. The reverse shape, 3.9 over 7.1 over 11.4, says the storm already passed; the user who reported slowness was telling the truth ten minutes ago, the evidence is evaporating, and your priority shifts from live commands to logs and dashboards before the trail goes cold. Flat and high means a steady state you can investigate calmly. Flat and low means the box is not loaded at all, which is itself a finding: keep running the checklist, because "slow" with no load usually ends at the network or the application, but you have already ruled out the most common story.

One caveat to carry through everything that follows: on Linux, the load average counts tasks in uninterruptible sleep as well as tasks that want CPU. A machine waiting on a dying disk can post a load of 40 with an idle processor. So a high number here does not yet tell you which resource is in trouble, only that something is. That is fine. This step's job is "is there a fire and is it growing", not "where".

The decision. Load well above nproc → something is saturated; the rest of the checklist finds what. Load low and flat → keep going anyway, but lean network/app. Rising → hurry. Falling → start preserving evidence.

Step 2 — dmesg: is anything screaming

Before measuring anything subtle, check whether the kernel is already shouting the answer. Ten seconds, and it occasionally ends the whole investigation.

$ dmesg -T | tail -20
[Mon Jun  8 09:58:31 2026] TCP: out of memory -- consider tuning tcp_mem
[Mon Jun  8 10:01:12 2026] Out of memory: Killed process 22114 (worker) total-vm:6483920kB, anon-rss:4112340kB
[Mon Jun  8 10:01:12 2026] oom_reaper: reaped process 22114 (worker), now anon-rss:0kB
[Mon Jun  8 10:04:58 2026] blk_update_request: I/O error, dev sdb, sector 488397168
…any of these lines, by itself, reroutes the rest of the hour

You are scanning for a handful of patterns, and each one is a finished diagnosis wearing a log line. An OOM kill means the kernel ran out of memory and shot a process; if the victim is your service or something it depends on, "slow" is really "dying and restarting", and you can skip straight to the memory path. I/O errors against a device mean the storage layer is failing, not merely busy, and that is a hardware-or-driver conversation rather than a performance one — remember errors are the E in USE, and they outrank everything else when present. Filesystem remounts to read-only, NIC link flaps, hung-task warnings, TCP memory pressure: all of them are the kernel filing an incident report nobody read.

Most days the tail is boring, and boring is information too: it means the slowness is a matter of degree, not of breakage, and the remaining six commands will find it. The wider craft of reading the kernel ring buffer and its journald-flavoured cousin — how far back the buffer goes, what survives a reboot, how to filter by priority — lives in journalctl & dmesg. At triage time you need only the tail and the nerve to actually read it.

The decision. OOM kills → the memory path, step 5, with the cause half known. I/O errors → stop tuning, start checking the device's health. Nothing notable → good, on to the X-ray.

Step 3 — vmstat: the whole-box X-ray, and the routing decision

This is the step the rest of the page pivots on. One command shows every resource at once, one line per second, and the columns sort the incident into a lane. Always ask for several samples — the first line is an average since boot and is worthless; what you want is the rhythm of the next four.

$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache  si   so    bi    bo   in    cs us sy id wa st
 2  0 204800 412116 88204 8201340    0    0    112   356 4211  8120 21  7 70  2  0
 3  0 204800 410988 88204 8201422    0    0     96   412 4577  8455 24  8 66  2  0
 2  1 204800 409732 88208 8201510    0    0    204   388 4302  8218 22  7 69  2  0
  r: runnable tasks    b: blocked, almost always on disk    si/so: swap in/out per second
  us/sy: CPU in your code / in the kernel    id: idle    wa: idle while disk I/O is outstanding

Read it column by column, because each group answers one of the USE questions for one resource. r is the run queue: how many tasks want a core right now, the running ones included. Compare it to core count exactly as you compared load average, except this number is instantaneous and does not mix in disk waiters, so it is the honest CPU-saturation reading the load average only gestures at. b is tasks parked in uninterruptible sleep, which in practice means blocked on storage; any persistent nonzero value here is a disk queue forming. si and so are pages swapped in and out per second, and they are the single most alarming pair on the screen: sustained nonzero values mean the box does not have enough memory for its working set and is paying disk prices for memory access. Not "swap is used" — old, cold pages sitting in swap are harmless — but swap traffic, happening now.

Then the CPU block, which is a budget for where each second went. us is your code, sy is the kernel working on your code's behalf, id is truly idle, and wa is the strange one: the CPU sitting idle specifically because disk I/O is outstanding. High wa does not mean the CPU is busy; it means the CPU is a customer waiting at the storage counter. The full tour of every column, and of why free output confuses everyone the first time, is in free & vmstat — at triage time you need the six highlighted ones.

the vmstat header, as a switchboardrbsi sous sy idwaeverything ≈ 0CPU pathr ≥ cores → step 4memory pathsi/so > 0 → step 5disk pathwa high, b > 0 → step 6network / app pathall quiet → step 7b and wa both point at disk; si/so outranks everything when nonzero
Six columns, four lanes. vmstat does not diagnose anything; it routes you to the tool that will.

And that is the routing decision, the one moment in the sixty seconds where you actually choose a direction. r persistently at or above core count, with us plus sy eating the budget: the CPU lane, step 4. si/so sustained above zero: the memory lane, step 5, and it wins any tie, because a swapping box corrupts every other reading — processes block on swap-in and inflate b, the disk gets busy with swap traffic, everything looks guilty at once, and memory is the cause of all of it. wa high with b nonzero while r stays modest: the disk lane, step 6. And all of it quiet — run queue short, no swap traffic, CPU mostly idle, wa near zero — is the fourth lane, the one juniors miss: the box is fine and the slowness lives somewhere this screen cannot see. Step 7.

The decision. r high → CPU path. si/so nonzero → memory path, and it overrides the others. wa/b up → disk path. All quiet → network or the application itself.

Step 4 — the CPU path: pidstat names the consumer

If vmstat routed you here, the box is compute-bound and the question collapses to "who". pidstat at one-second intervals gives a rolling per-process answer that, unlike a screenshot of top, pastes cleanly into the incident channel with its own averages at the bottom.

$ pidstat 1 5
Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1001     31764   612.4    38.2    0.00    14.1   650.6     -  feedsvc
Average:        0      1290     3.1     1.2    0.00     0.4     4.3     -  nginx
Average:      998      2114     1.8     0.9    0.00     0.2     2.7     -  postgres
  one process owns six and a half of eight cores; %wait > 0 says it wants even more

Two readings and you are done here. First, is there a single dominant consumer or a crowd? One process at 650% is a suspect with a name; forty processes at 15% each is a fan-out problem — a cron storm, an oversized worker pool, a container limit that stopped limiting — and the fix concerns how many of them exist, not what any one is doing. Second, the %usr versus %system split: code versus kernel-on-behalf-of-code, which decides whether the deeper investigation reaches for a profiler or a syscall counter.

Either way, the triage is over and the specialist investigation begins. The full CPU walk — busy versus waiting, process to thread to function to syscall, and the four endings every CPU incident resolves to — is what's eating my CPU?, and you arrive at its step 1 with the routing already done.

Step 5 — the memory path: free -h, read correctly

If si/so were moving, confirm the squeeze and size it.

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        29Gi       241Mi       1.1Gi       1.8Gi       412Mi
Swap:          2.0Gi        1.9Gi        88Mi
  available is the number that matters: 412Mi of headroom on a 31Gi box, swap nearly full

Resist the reflex to read the free column; it is nearly always small on a healthy machine, because Linux spends idle memory on the page cache and gives it back on demand. The honest number is available: the kernel's estimate of how much memory could be handed to a new allocation without swapping. Here it is 412Mi against 31Gi, with swap 95% occupied and — from vmstat — actively churning. That is a box past the edge: the working set no longer fits, every page the kernel reclaims is a page something needed, and the cost surfaces as latency everywhere at once, which is exactly why the complaint arrived as "slow" rather than as anything more specific. A second tell is buff/cache crushed down to 1.8Gi on a machine this size; the cache is the first thing reclaim eats, and a skinny cache means file reads that used to be free now hit the disk, so the memory shortage quietly becomes a disk problem too.

From here the question is which process grew and why, whether it is a leak or a legitimate working-set change, and what the OOM killer will do about it if you do nothing. That investigation, including how to read RSS honestly and what the kill scoring means, is what's eating my memory? — go there with the free output in hand.

Step 6 — the disk path: iostat, then the writer's name

If wa was high and b nonzero, the storage layer is the queue. iostat -xz 1 shows per-device truth; the x gets the extended columns that matter and the z hides idle devices.

$ iostat -xz 1
Device     r/s     w/s     rkB/s     wkB/s  r_await  w_await  aqu-sz  %util
nvme0n1   38.0   614.0    1520.0  311296.0     1.2    48.7    29.4   99.6
nvme1n1   12.0    22.0     488.0     310.0     0.4      0.9     0.1    3.8
  one device pinned at 99.6% with 29 requests queued and writes waiting 48ms each;
  its neighbour is idle — this is a single hot device, not a saturated controller

Three columns carry the verdict. %util is the share of time the device had at least one request in flight — utilisation, the U. A device at 99.6% for seconds at a time is busy every moment you looked. aqu-sz is the average queue depth — saturation, the S — and it is the more damning number: 29 requests waiting means arrivals outrun completions and every new request joins a line. And await (split into reads and writes on modern sysstat) is what the queue costs: each write here waits 48 milliseconds from issue to completion on a device whose hardware service time is a fraction of a millisecond. The latency is the queue, not the disk. One honest caveat before you quote %util in the incident channel: on SSDs and anything RAID-like, which serve many requests in parallel, 100% util does not mean the device is out of capacity — it means at least one request was always in flight. Queue depth and await are the numbers that prove actual suffering.

A saturated device is half an answer; the other half is who is saturating it. Per-process I/O comes from the same tool that named the CPU consumer:

$ pidstat -d 1 5
Average:      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
Average:        0     48211      0.0  297410.2       0.0       2  backup-agent
Average:      998      2114    412.5    1184.7       6.1      64  postgres
  the writer writes 297 MB/s and barely waits; the database writes 1 MB/s and queues behind it

Read the iodelay column against the throughput columns and the power structure of the device is laid bare: the backup agent produces 99% of the writes and suffers almost none of the delay, while the database, doing a trickle of I/O, eats the queue the backup built. The victim and the culprit are different processes, which is the usual shape of a disk incident and the reason "which process is slow" and "which process caused it" are different questions.

The decision. One device hot, one process writing → you have a culprit; throttle it, reschedule it, or move its target. Device hot with no heavy writer → suspect swap traffic (back to memory) or a sick device (back to dmesg). All devices modest but await high → the storage is remote and the network path may own this after all.

Step 7 — the network and app path: ss -s, then off the box

If every local resource came back clean, the slowness is in flight or in logic. The first reading is the socket summary, which is to the network what uptime was to the CPU — one cheap line that says whether anything is obviously off.

$ ss -s
Total: 1412
TCP:   1389 (estab 212, closed 941, orphaned 3, timewait 902)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       12        9         3
TCP       448       430       18
  902 sockets in timewait on a service that should hold a few dozen long-lived
  connections: something is opening and closing a connection per request

You are pattern-matching, not measuring. A timewait population in the hundreds or thousands on a service that should pool its connections means churn — a connection per request, somewhere, paying a setup cost on every call. An orphaned count climbing means closed sockets still draining to a peer that has stopped reading. Established connections far above or below what the service's architecture predicts is a question all by itself. None of these is proof; each is a thread to pull, and the pulling — retransmits, drops, listen queue overflows, and the order in which to suspect them — is the subject of is it the network?, which picks up exactly where this summary leaves off.

The routing, on one screen

"the box is slow"uptime · dmesg | tailloaded? trending? anything screaming?vmstat 1 5 — the routerr · b · si/so · us sy id waCPU laneU: us+sy · S: r vs coresE: throttle countersmemory laneU: available · S: si/soE: OOM kills in dmesgdisk laneU: %util · S: aqu-sz, awaitE: I/O errors in dmesgnetwork laneU: bandwidth · S: queuesE: retrans, dropspidstat 1who owns the cores?free -hread "available"iostat -xz 1 · pidstat -dwhich device · which writerss -schurn? orphans?what's eating my CPU?page 13what's eating my memory?page 14culprit named on the spotthrottle / move / rescheduleis it the network?or the app itselfnothing saturated anywhere?the box is fine — look upstream, downstream, or inside the appr ≥ coressi/so > 0wa, b upall quiet
The triage as four USE-shaped lanes. The verdant spine is the part you always run; vmstat picks the lane; each lane ends at a specialist page or a named culprit.

Notice the structure: the spine is unconditional, the lanes are exclusive, and every lane checks the same three things for its resource — how busy, how queued, how broken. If you find yourself running commands that do not fit a lane, you have started theorising; go back to the last reading you trusted.

When nothing is saturated

The fourth ending deserves its own section because it is the one engineers handle worst. You ran the eight. Load is modest, the kernel is calm, the run queue is short, swap is still, the disks are loafing, the sockets look normal. The temptation is to distrust the readings and run them again, louder. Do not. A clean triage is a finding, and a valuable one: the box is fine, so the slowness lives in one of three places this checklist cannot see.

Upstream or downstream of the box, first. The service is slow because something it calls is slow — a database on another host, a third-party API, a DNS resolver taking its time. From this machine's point of view, waiting on a slow dependency is indistinguishable from idleness; the threads sit parked in network reads, consuming nothing. Your latency dashboards, broken down by dependency, answer this faster than any further command here can, and the per-connection forensics live in is it the network?

Inside the application, second. Software is very good at being slow without using any resource the kernel can see. Lock contention: fifty threads serialising politely through one mutex produce near-zero CPU and terrible latency. Pool exhaustion: a connection pool sized at ten under load that needs forty makes every request queue at the pool, invisible to vmstat because the queue is a data structure, not a run queue. A garbage collector pausing the world a hundred times a minute charges its cost as latency, not as sustained utilisation. The tells are application-level: thread dumps full of parked threads, pool metrics pinned at their maximum, GC logs with pause times that match the complaint. A quick look at process states in top & htop helps here — a service that is "slow" while every thread sleeps is confessing that it is waiting on something — and a syscall trace with strace will show threads blocked in futex or read calls, naming the wait directly.

And occasionally: nowhere, because the complaint is wrong. The user's wifi, a stale cache on their side, a dashboard averaging two populations. Before you spend the afternoon, get one number that pins the slowness to a request ID or a timestamp you can find in your own telemetry. "Slow" that cannot produce a single slow request is weather, not a system problem.

A worked example, end to end

The pager fires at 02:13: p99 latency on api-04 tripled. Nothing deployed since Friday. SSH in and run the spine.

$ uptime; nproc
 02:14:51 up 64 days,  9:12,  1 user,  load average: 13.80, 11.21, 6.02
8
$ dmesg -T | tail
  …routine lines only, nothing screaming
$ vmstat 1 5
 r  b   swpd   free   buff  cache   si   so    bi     bo    in    cs us sy id wa st
 1  6 102400 822416 41200 9904188    0    0    88 301244  6120  9412  8  5 46 41  0
 2  7 102400 821980 41200 9904212    0    0   120 298332  6080  9388  9  6 44 41  0
  → load 13.8 but r ≈ 2: the load is not CPU demand. b at 6-7 and wa at 41:
    half a dozen tasks permanently blocked on disk, and 300 MB/s of writes. Disk lane.

Load of 13.8 on 8 cores looked like a CPU story for exactly as long as it took vmstat to print. The run queue is nearly empty; the load average is being inflated by the blocked column, those six or seven tasks in uninterruptible sleep, which Linux counts as load. No swap traffic, so memory is clear. wa at 41 with bo around 300,000 kB per second routes this to step 6.

$ iostat -xz 1
Device     r/s     w/s    rkB/s     wkB/s  r_await  w_await  aqu-sz  %util
nvme0n1   44.0   598.0   1760.0  306176.0     9.8     52.3    31.2   99.2
$ pidstat -d 1 5
Average:      UID       PID   kB_rd/s    kB_wr/s kB_ccwr/s iodelay  Command
Average:        0     61844      0.0   301228.4       0.0       3  backup-agent
Average:      998      2114    388.2     1022.6       4.8      71  postgres
  → one device at 99% with a 31-deep queue and 52ms writes; one process producing
    the writes. The 02:00 backup is flooding the same NVMe the database lives on.

Three commands past the spine and the story is whole: the nightly backup job started at 02:00, writes 300 MB/s to the same device that serves the database's WAL and reads, the device queue went to 31, every database fsync now waits 52 milliseconds behind backup blocks, and the API's p99 — which is mostly database time — tripled. The fix at 02:20 is operational: pause or throttle the backup (an ionice class, a bandwidth cap in the agent's config) and watch aqu-sz collapse and p99 follow within a minute. The fix at 10:00 is structural: backups write to a different device or a different window, and the alert that fired learns to annotate itself with iostat so the next person starts at step 6. Total time from page to mitigation, under ten minutes, and not one step required a guess — uptime said "saturated", vmstat said "disk, not CPU", iostat named the device, pidstat named the process, cron named the human.

The fast version. The eight, in order, each a few seconds: uptime (loaded? trending?) · dmesg | tail (anything screaming?) · vmstat 1 5 (the router: r, b, si/so, us/sy/id/wa) · then the lane it picks: pidstat 1 (CPU: who?) · free -h (memory: read available) · iostat -xz 1 (disk: %util, await, aqu-sz) · pidstat -d 1 (disk: who's writing?) · ss -s (network: churn, orphans). Sixty seconds, and you exit knowing which specialist page to open — or knowing the box is innocent, which is worth just as much.

What to write in the incident notes

The triage produces a small, specific artifact, and writing it down takes two minutes while the commands are still in scrollback. First, the route: which lane vmstat picked and why, in one line — "load 13.8 was b-driven, wa 41, routed to disk" tells the next responder more than a paragraph of impressions. Second, the raw readings: paste the actual uptime line, the two or three vmstat samples, the iostat row, the pidstat rows. Outputs can be re-read when the conclusion is challenged; adjectives cannot. Third, the culprit and the mechanism, named as specifically as the evidence allows: not "disk contention" but "backup-agent wrote 300 MB/s to nvme0n1 starting 02:00, queue depth 31, database fsyncs waited 52ms". Fourth, what you changed and when, to the minute, so the recovery visible in every dashboard has a label. Fifth, the prevention item with an owner: move the backup, cap its bandwidth, re-key the alert. And one habit worth stealing: if the triage came back clean, write that down too, with the readings. "Checked all four lanes at 02:15, box clean, escalating to the app team" saves the next responder from re-running your minute and saves you from the suspicion that you skipped it.

Further reading

Found this useful?