01 / 28

Linux / 01

top & htop

You SSH into a box because something is slow, and the first thirty seconds decide whether the next hour is diagnosis or guesswork. top is the tool for those thirty seconds. It answers one question: what is this machine doing right now, and who is responsible? Most engineers run it, glance at the big numbers, and close it without reading any of them. This page fixes that. The five keys worth knowing, a line-by-line read of the header everyone skips, three production incidents, what the numbers come from in /proc, and a drill that ends with you deliberately pinning a core and then letting it go.

The question it answers

Every performance investigation starts from the same place: a machine that is misbehaving and a person who does not yet know why. Before you can ask the good questions — which resource is saturated, which process is at fault, whether the problem is CPU, memory, disk, or none of those — you need a live picture of the whole box. That is the job of top. It samples the kernel's accounting a few times a second, sorts every process by how much CPU it is burning, and paints the result on your terminal in a loop. It is the first command worth typing on an unfamiliar machine, which is why it is the first page in this series.

htop is the same idea with better manners: colour, scrolling, mouse support, per-core meters drawn as bars, and a tree view that shows which process spawned which. Day to day, htop is the nicer place to live. But the two tools read the same kernel counters and answer the same question, and top ships with effectively every Linux system while htop often does not. Learn to read top cold and htop becomes a comfort, not a dependency. The reverse leaves you stranded the first time you land on a minimal container image at 3am.

It also helps to know what these tools are not. They are not profilers; they tell you that a process is burning CPU, not which function inside it is responsible — for that, the trail continues in what's eating my CPU? They are not historians; they show the state of the machine right now, and a spike that ended two minutes ago has already left the screen. And they are not the whole picture: memory pressure and disk traffic get one summary line each, and when those lines look suspicious the follow-up tools are free & vmstat. What top gives you is the triage view, the same role the first checks play in the USE method: utilisation and saturation for the whole box, with names attached.

The five keys that matter

Both tools are interactive, and the keyboard is where the value is. The default view is a CPU leaderboard; one keystroke turns it into a memory leaderboard, a per-core view, or a process tree. These are the keys that earn their place in your fingers.

Key / flag	What it does	When you reach for it
`top -o %MEM`	Starts top already sorted by resident memory	"What is eating the RAM" — the second most common question after CPU
`P` / `M` / `T`	Re-sorts the live view: by CPU, by memory, by cumulative CPU time (all shift+key)	Flipping between leaderboards mid-investigation; `T` finds the long-running grinder that is never on top of the instantaneous view
`1`	Expands the single %Cpu(s) summary into one line per core	Whenever the box "looks fine" but one thing is slow — averages hide pinned cores
`c`	Toggles full command lines in the COMMAND column	Ten identical `java` or `python` rows; the arguments tell them apart
`e` / `E`	Cycles memory units (KiB, MiB, GiB…) in the task list / the summary header	Reading `12782340` as 12.2 GiB without doing arithmetic under stress
htop: `F5`	Tree view — processes nested under their parents	Working out who spawned the thing that is misbehaving, and what dies with it if you kill the parent

Two habits worth forming early. First, sorting answers most questions before you ever read a number: sorted by CPU, the culprit of a CPU problem is on line one; press M and the culprit of a memory problem is on line one. Second, in htop the tree view changes what a kill means. A worker that keeps coming back from the dead usually has a supervisor respawning it, and F5 shows you the supervisor sitting one level up. The mechanics of actually stopping things, and the difference between asking and insisting, live in kill & signals.

One non-interactive flag. top -bn1 runs one iteration in batch mode and exits — that is how you put top output into a script, a log, or a ticket. Mind the caveat in the pitfalls section though: the CPU percentages in the very first iteration are averages since boot, not "right now."

Reading the header

Here is a realistic header from a 4-core web box that is having a bad afternoon. Most people's eyes slide straight past these five lines to the process list below. The header is the better half of the tool.

$ top
top - 14:32:07 up 41 days,  3:12,  2 users,  load average: 6.41, 5.87, 4.92
Tasks: 213 total,   2 running, 210 sleeping,   0 stopped,   1 zombie
%Cpu(s): 12.3 us,  4.1 sy,  0.0 ni, 71.2 id, 10.9 wa,  0.0 hi,  0.4 si,  1.1 st
MiB Mem :  15842.3 total,    412.7 free,   9216.4 used,   6213.2 buff/cache
MiB Swap:   2048.0 total,   2046.1 free,      1.9 used.   5904.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  41327 deploy    20   0   12.4g   2.1g  24512 S 186.7  13.6 412:11.07 java
    812 postgres  20   0  328940 121408  98244 D   0.7   0.7  88:14.92 postgres
   1290 root      20   0  142212  18044  11236 S   2.0   0.1   3:02.55 nginx

The load average, actually explained

Three numbers: averages over the last 1, 5, and 15 minutes. The folk definition — "how many processes wanted CPU" — is wrong on Linux in a way that matters. Linux counts two kinds of task into the load: tasks that are runnable (running on a CPU or queued waiting for one) and tasks in uninterruptible sleep, the D state, which almost always means blocked on disk or another piece of slow I/O. So the Linux load average is a demand number for the whole machine, not for the CPU alone. A load of 6 can mean six tasks fighting over the processors, or one task computing while five sit parked in D state waiting on a sick disk. Same number, opposite diagnoses, and the %Cpu(s) line below it is how you tell them apart.

The numbers only mean something relative to the core count. One runnable task per core is a machine working at capacity with no queue; more than that and tasks are waiting. On this 4-core box, a 1-minute load of 6.41 says that, on average, two and a half tasks' worth of demand had to wait at any instant. The same 6.41 on a 32-core box is a quiet Tuesday. Check the core count with nproc before you let any load number alarm you. The three windows give you a slope as well as a level: 1-minute above 15-minute means the problem is arriving; 1-minute below 15-minute means it is leaving and you may be looking at the aftermath rather than the cause.

The same load number is healthy or alarming depending on where the core-count line sits. Read level against nproc, then read the slope across the three windows.

The %Cpu(s) line

Eight numbers that say where every CPU cycle went during the last refresh interval. us is user time: your programs running their own code. sy is system time: the kernel working on behalf of those programs — syscalls, network stack, filesystem work. A high sy relative to us means processes are asking the kernel to do something over and over, which is its own clue. ni is user time from processes running at a lowered priority, and id is genuine idle. hi and si are hardware and software interrupt handling, normally near zero and interesting when they are not.

The two that decide incidents are wa and st. wa is iowait, and it is widely misread: it does not mean the CPU is busy doing I/O. It means the CPU is idle, and at least one task on it is blocked waiting for I/O to finish. It is idle time with an asterisk — the processor has nothing to do because the disk has not answered yet. That is why "high load, idle CPU" is not a paradox; the waiting tasks count toward load while contributing nothing but wa to this line. st is steal time, and it only exists on virtual machines: the slice of time your VM had a task ready to run but the hypervisor gave the physical CPU to someone else. Inside the VM there is nothing to fix; the contention is on the host, between you and tenants you cannot see.

VIRT, RES, SHR — the decoder nobody teaches

Three memory columns per process, and the biggest one is the least meaningful. VIRT is virtual size: the total address space the process has mapped. It counts heap the allocator reserved but never touched, files mapped into memory whether or not any page was read, anything swapped out, and shared libraries over again for every process that maps them. It is a measure of promises, not of RAM. A JVM or a Go service showing tens of gigabytes of VIRT on a 16 GB machine is normal and fine.

RES is the resident set: the pages actually sitting in physical RAM right now. This is the column that means what people think VIRT means, and it is the number behind %MEM. When you are hunting a memory hog, sort by %MEM and read RES. SHR is the slice of RES that is shared with other processes — mostly shared libraries and explicitly shared memory segments. It matters when you are tempted to multiply: ten workers each showing 200 MB RES with 150 MB SHR are not using 2 GB, because most of that SHR is the same physical pages counted ten times. A process's private footprint is closer to RES minus SHR than to RES, and nowhere near VIRT.

The three memory columns as containers. VIRT promises, RES occupies, SHR is the part of RES that other processes occupy too.

One more row in the sample output deserves a word: the S column, process state. R is running or runnable, S is ordinary sleep, Z is a zombie (exited, waiting for its parent to collect the exit status — one or two are cosmetic, hundreds mean a buggy parent), and D is the uninterruptible sleep from the load average discussion. The postgres row above is in D state. Hold that thought.

Three production scenarios

Load is 14 but the CPU is idle

An alert fires on load average. You log in and the header makes no sense at first read: load 14 on a 4-core box, but id says the CPUs are 78% idle. The reconciling number is sitting right there: wa at 19. The machine is not short of CPU; it is full of tasks in D state, blocked on storage, each one counting toward load while the processors twiddle. The disk is the bottleneck, the load average is just the queue forming behind it.

$ top -bn1 | head -3
top - 02:14:31 up 12 days, 22:40,  1 user,  load average: 14.22, 11.04, 6.31
Tasks: 188 total,   1 running, 175 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.8 us,  1.1 sy,  0.0 ni, 78.0 id, 19.1 wa,  0.0 hi,  0.0 si,  0.0 st

$ ps -eo state,pid,comm | awk '$1=="D"'
D    812 postgres
D    815 postgres
D    819 postgres
D   2204 kworker/u8:3

The ps one-liner lists the D-state tasks by name, and a cluster of them from the same service points the finger. From here the investigation belongs to the disk: is it saturated, dying, or an NFS mount that stopped answering? The tools for that next step, including watching the b column count blocked tasks over time, are on the free & vmstat page. The lesson that survives the incident: a load alert is not a CPU alert. Read wa before you assume.

The cloud VM that lost a third of its CPU

A service on a cloud VM gets slower over a week with no deploy and no traffic change. CPU graphs from inside the box show usage well below 100%, yet latency keeps creeping. The header tells the story in one number most dashboards never plot:

$ top -bn1 | grep Cpu
%Cpu(s): 41.2 us,  8.3 sy,  0.0 ni, 18.1 id,  0.6 wa,  0.0 hi,  0.9 si, 30.9 st

Thirty-one percent steal. Nearly a third of the time this VM had work ready, the hypervisor handed the physical core to another tenant. Two usual causes: a noisy neighbour on an oversubscribed host, or a burstable instance type that has spent its CPU credits and is being throttled by design — check which class of instance you are on before blaming anyone. Either way, no amount of tuning inside the guest gets those cycles back. The fixes are operational: resize to a non-burstable type, redeploy so the scheduler places you on a different host, or pay for dedicated capacity. A few percent of transient st is life on shared hardware; sustained double digits is a capacity problem wearing a performance costume, and the cheapest thing you can do is stop profiling your own code and look at this line first.

One core pinned while the box "looks fine"

An 8-core machine shows 13% total CPU, every dashboard is green, and yet one workload is mysteriously slow. The summary line is an average over all cores, and an average is exactly the right tool for hiding one bad core among seven idle ones. Press 1:

$ top   (then press 1)
%Cpu0  :  2.0 us,  1.0 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  :  1.7 us,  0.7 sy,  0.0 ni, 97.3 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 99.0 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  3.0 us,  1.3 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
...cores 4-7 similar...

Core 2 is saturated and has been for as long as the workload has been slow. The usual suspects: a single-threaded program that simply cannot go faster than one core, a hot thread inside a multi-threaded service (a garbage collector, a lone event loop, a compression job), or interrupt handling pinned to one core by IRQ affinity so that all network processing lands in one place. Identifying which thread it is and what it is executing is the subject of what's eating my CPU?, and the reason one task stays glued to one core rather than spreading is the scheduler's affinity logic, covered in scheduling. The habit to take away is cheap: any time a machine "looks fine" but is not, press 1 before you trust the average.

Where the numbers come from

Neither tool has privileged access to anything. Everything on the screen is read from /proc, the kernel's window onto its own state, and you can read the same files with cat. The load average is the file /proc/loadavg: three damped moving averages the kernel maintains as part of its regular timekeeping, decaying old samples exponentially so the 1-minute number reacts fast and the 15-minute number smooths the noise. What gets counted into them is the part worth remembering — tasks runnable plus tasks in uninterruptible sleep, which is the design decision that makes Linux load a whole-machine demand signal rather than a CPU queue length.

$ cat /proc/loadavg
6.41 5.87 4.92 2/213 41330
          the three averages, then runnable/total tasks, then the last PID handed out

$ head -2 /proc/stat
cpu  84321907 21340 19833502 933012890 8231201 0 412390 901230 0 0
cpu0 21080476  5335  4958375 233253222 2057800 0 103097 225307 0 0
      user    nice   system   idle      iowait hi  si    steal  ...

The %Cpu(s) line comes from /proc/stat, which holds one counter per CPU per category — user, system, idle, iowait, steal, and the rest — each ticking up forever since boot. The counters are cumulative, so a single read is meaningless; top reads the file, sleeps for the refresh interval, reads it again, and the percentages you see are the deltas. Per-process numbers work the same way from /proc/PID/stat (CPU time consumed) and /proc/PID/status (memory: VIRT, RES, and SHR under their kernel names VmSize, VmRSS, and RssShmem plus file-backed pages). top and htop are, to a first approximation, loops that read these files, subtract, divide by the interval, and sort.

Knowing this buys you two things. During a bad incident on a stripped-down box with no tools installed, cat /proc/loadavg and two reads of /proc/stat get you the header by hand. And the numbers stop being oracle pronouncements: a percentage in top is a sampled difference between two counters, subject to sampling error and aliasing like any other measurement. The full tour of the filesystem behind all of this is on the /proc page, and if you want to watch run queues form and tasks migrate between cores instead of reading about it, the scheduler simulator lets you generate the load and see the queueing happen.

Pitfalls

%CPU above 100 is not a bug. In top's default mode (Irix mode), a process's %CPU is measured against a single core, so a process running four busy threads on four cores shows 400%. The java row in the header example reads 186.7% for exactly this reason. Pressing shift+I toggles Solaris mode, which divides by the core count so 100% means the whole machine. Neither is wrong; you just need to know which one you are reading before you quote a number in an incident channel. htop uses the per-core convention too.

The VIRT panic. Someone sorts by VIRT, sees a 40 GB process on a 16 GB machine, and declares a leak. VIRT is address space, not memory, and modern runtimes reserve it wholesale: the JVM maps its maximum heap up front, Go reserves a large arena, anything using mmap on big files counts them all. The kernel hands out address space optimistically and only commits physical pages on first touch. Memory pressure is real when RES is large and growing, when swap usage climbs, or when available memory shrinks toward zero — and those last two live on the free & vmstat page. VIRT alone has never been an emergency.

htop is not always there. Minimal server images, containers, and rescue environments routinely ship without it, and during an incident is the wrong moment to discover that your fingers only know F5 and mouse clicks. Every skill on this page was written against plain top first for that reason. Practice the vanilla keys until the fancy tool is a luxury rather than a requirement.

The first batch sample lies. top -bn1 prints percentages computed from the since-boot counters, because there is no previous sample to diff against. On a box that has been up for 41 days, that first frame is a 41-day average — useless for "right now." Use top -bn2 and read the second frame, or accept the interactive tool's second refresh for the same reason.

Watching the watcher. top itself costs a little CPU, more with fast refresh intervals on busy machines, and it will happily appear in its own leaderboard. If a screenshot of top shows top near the top, that is not the smoking gun it appears to be.

A drill you can run right now

Everything below is safe on any Linux machine, including a shared one. The only thing it creates is one deliberately busy process that you will kill at the end, and the only thing that process does is write the letter y into a black hole.

Step 1 — establish the baseline. Run nproc and remember the number, then open top and read the header against this page: load average relative to core count, then the %Cpu(s) line left to right, then the Mem line. Find the largest process by CPU, press M to re-sort by memory, press P to flip back. Press c and watch the COMMAND column grow arguments. Press 1 and count the cores you saw with nproc.

Step 2 — pin a core on purpose. In a second terminal, start the noisiest harmless process Unix offers, then watch it land in top:

$ yes > /dev/null &
[1] 50612
$ top   (press 1, then P)
%Cpu3  : 99.7 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  50612 nilesh    20   0    8124    980    876 R 99.7   0.0   0:41.32 yes

yes prints y forever; redirected to /dev/null, it becomes a pure CPU burner. With the per-core view open, watch one core sit at 100% user time while the others idle — the pinned-core scenario, manufactured. Notice the process state is R and its TIME+ climbs in real time. If you wait a minute or two with cat /proc/loadavg, you can watch the 1-minute load drift up toward 1.0 while the 15-minute number barely moves: the damped averages reacting at their different speeds. Sometimes the scheduler migrates the burner between cores mid-watch; that wandering is load balancing happening in front of you, and the scheduler simulator shows the same decision-making slowed down.

Step 3 — clean up, two ways. Kill it from the shell that started it with kill %1, or do it from inside the tool: in top press k, give the PID, and accept the default signal; in htop select the row and press F9. Run jobs to confirm nothing is left. What those signals actually are, and when the default is the wrong one, is the subject of kill & signals.

Step 4 — if htop is installed, take the tour. Open htop, press F5, and find your shell: terminal emulator or sshd at the top, your shell under it, htop itself as a child. Run the yes trick again and watch it appear in the tree under your shell, then kill it from the tree. Parentage is the thing top makes you reconstruct by hand and htop just shows you.

If you remember one line. Load average counts runnable plus uninterruptible tasks, so read it against nproc and check wa before blaming the CPU. Press 1 when the box looks fine but is not. And judge memory by RES, never by VIRT.

top & htop

The question it answers

The five keys that matter

Reading the header

The load average, actually explained

The %Cpu(s) line

VIRT, RES, SHR — the decoder nobody teaches

Three production scenarios

Load is 14 but the CPU is idle

The cloud VM that lost a third of its CPU

One core pinned while the box "looks fine"

Where the numbers come from

Pitfalls

A drill you can run right now

Further reading

02 — free & vmstat