02 / 28

Linux / 02

free & vmstat

Someone pastes a graph in the incident channel: free memory has been falling for days and is nearly at zero. Is the box about to die? Almost certainly not — but it might be, and the difference between "the kernel is caching files, as designed" and "this machine is swapping itself to death" is the difference between closing the ticket and paging the on-call. free tells you where the memory went; vmstat tells you whether the machine is suffering for it. This page covers the handful of invocations worth knowing, decodes the columns that actually matter, walks three production scenarios, and ends with a drill you can run on any machine without breaking anything.

The question it answers

There are two memory questions on a Linux box and they get conflated all the time. The first is an inventory question: where has the memory gone? The second is a behaviour question: is anything suffering because of it? free answers the first with a one-shot snapshot of what the kernel is doing with every byte of RAM. vmstat answers the second by sampling the machine over time and showing you whether memory pressure is translating into work: pages moving to and from swap, processes blocked on I/O, CPU time burned waiting. You need both, because each one is misleading without the other.

The conflation matters because Linux deliberately runs with very little free memory. Any RAM that nothing has claimed gets used to cache file contents, on the theory that an unused byte of RAM is a wasted byte of RAM. So on a healthy machine that has been up for a while, free memory trends toward a small number and stays there, and the graph that looks like a slow leak is usually just the page cache filling up. The kernel will hand that cached memory back the moment a process asks for it. A box with 2% free memory can be perfectly fine; a box with 20% free can be minutes from the OOM killer. The free-memory number alone cannot tell you which one you have.

What can tell you is movement. A machine that is genuinely short of memory does not sit still: it starts evicting things it will need again, writing anonymous pages out to swap and reading them back in, over and over. That churn is called thrashing, and it has a precise signature in vmstat's si and so columns. The mental model for this page is therefore a two-step check. Step one, free -h: is the available number (not the free number) actually low? Step two, vmstat 1: is the machine actively swapping right now? Low available plus sustained swap traffic is a real memory problem. Everything else is probably the cache doing its job. The tools that come later in an investigation — top & htop to rank processes, the full hunt in what's eating my memory? — start from the verdict these two give you.

The flags that matter

Both tools are small, and the useful surface is smaller still. Here is the working set.

Invocation	What it shows	When you reach for it
`free -h`	One snapshot of memory and swap, in human units	First question on any box: where is the memory?
`free -w`	Same, but buffers and cache as separate columns	When you care which kind of cache is large (rare)
`free -h -s 2`	The snapshot repeated every 2 seconds	Watching a number move while you provoke it
`vmstat 1`	One line of system-wide counters per second, forever	The behaviour question: is it thrashing right now?
`vmstat 1 10`	Same, but stops after 10 samples	Pasting evidence into a ticket
`vmstat -s`	Every counter since boot, one per line, labelled	Totals and history: how much swapping has ever happened

The interval form is the one to internalise. A bare vmstat with no arguments prints a single line, and that line is close to useless for the thrashing question, because most of its columns are averages since boot. A machine that swapped heavily during one bad hour three weeks ago will show nonzero swap activity in that first line forever after, and a machine that started thrashing two minutes ago will barely move it. The same first-line problem applies to vmstat 1 itself: the first row of output is the since-boot average, and only the second row onward shows what happened during the last interval. Throw the first line away, every time. It is the single most common misreading of this tool, and it earns a fuller treatment in the pitfalls below.

vmstat -s is the complement: it is supposed to be since-boot. When you want to know whether a machine has ever swapped at all, or roughly how much, the pages swapped in / pages swapped out lines give you the lifetime totals without any sampling. Both tools read their numbers from /proc/meminfo and /proc/vmstat; there is no agent and no daemon, just formatted kernel counters, which is why they cost almost nothing to run and why they are safe to point at a struggling box. More on the /proc side at /proc.

Reading free: why "free" is a lie

Here is free -h on a reasonably busy 32 GB application server. The row to read is Mem; the column to read is not the one your eye goes to first.

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        12Gi       1.2Gi       340Mi        18Gi        17Gi
Swap:          8.0Gi       1.1Gi       6.9Gi

Read it column by column. total is physical RAM minus what the kernel reserved for itself at boot. used is memory held by processes and the kernel that is not cache: heaps, stacks, page tables, slab. free is memory that nobody is using for anything at all, not even caching. shared is mostly tmpfs: files living in RAM-backed filesystems like /dev/shm, plus shared memory segments. buff/cache is the page cache and friends: file contents the kernel keeps in otherwise-idle RAM so the next read does not touch the disk. available is the kernel's own estimate of how much memory a new workload could claim before the system starts swapping.

The trap is the gap between free and available. On this box, free is 1.2 Gi, which looks alarming, and available is 17 Gi, which is the truth. The 18 Gi of buff/cache is not spoken for; it is opportunistic. Most of it is clean page cache — copies of file data that also exist on disk — and the kernel can drop those pages instantly, without writing anything anywhere, the moment a process needs the space. The free column counts only the RAM the kernel has not found any use for yet, which on a long-running machine is deliberately almost nothing. Treating that column as headroom is the classic mistake, and old habits make it worse: for years the folk remedy was the "-/+ buffers/cache" arithmetic from older versions of free. The available column made that arithmetic obsolete. It comes straight from MemAvailable in /proc/meminfo, where the kernel does the honest version of the calculation for you.

Honest, but still an estimate, and it is worth knowing what is inside it. Available is roughly free memory plus the part of the page cache and slab that is cheap to reclaim, minus a watermark the kernel insists on keeping in reserve. It is not free plus all of buff/cache, because not all of buff/cache can be dropped. Dirty pages must be written back first. Pages belonging to files that are mapped and hot will just get faulted straight back in. tmpfs lives inside the cache accounting but is real data with no disk copy behind it, so it cannot be reclaimed at all (it can only be swapped). That is why available on this box is 17 Gi and not 1.2 + 18 = 19 Gi. The diagram below is the picture worth keeping.

The memory bar behind free -h. The buff/cache segment belongs to the kernel only until someone needs it; available is the honest headroom number.

The Swap row reads the same way as Mem but answers a different question, and the distinction is one of this page's main points: swap used is a stock, not a flow. This box has 1.1 Gi sitting in swap, and by itself that means nothing about the present. Those pages may have been pushed out during a deploy last month and never touched since, which is swap doing exactly what it is for: holding cold anonymous pages so warm ones can have the RAM. Whether swap is being used right now is a question free cannot answer. For that you need the flow, and the flow lives in vmstat.

Reading vmstat: the columns that matter

vmstat 1 prints a header and then one line per second until you stop it. Here is the same 32 GB box on a normal afternoon. Remember: line one is the since-boot average, so the real data starts on line two.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd    free   buff   cache   si   so    bi    bo   in    cs us sy id wa st
 2  0 1153434 1264640 91432 18874368   1    1    24    31  210   480 9  3 87  1  0  <- since boot: ignore
 1  0 1153434 1259520 91432 18874368   0    0     0   116 4210  7110 12  4 83  1  0
 3  0 1153434 1248212 91432 18874604   0    0    12   204 4677  8021 14  5 80  1  0
 1  0 1153434 1251020 91432 18874604   0    0     0    88 3980  6850 11  4 84  1  0

Sixteen-odd columns, but only a handful carry the diagnosis. Here is the decoder, grouped the way the header groups them.

Column	What it counts	How to read it
`r`	Processes runnable: running or waiting for a CPU	Persistently above the core count means CPU contention
`b`	Processes in uninterruptible sleep (D state), almost always blocked on I/O	Persistently nonzero means something is stuck waiting on disk or NFS
`swpd`	Total swap in use (a stock)	History, not pressure — see the pitfalls
`si` / `so`	KiB/s swapped in from disk / out to disk (a flow)	Both sustained nonzero is the thrash signal
`bi` / `bo`	Blocks/s read from / written to block devices	General disk traffic; spikes are normal, walls are not
`us` / `sy`	CPU % in user code / in the kernel	High sy with high si/so is the kernel fighting for memory
`id` / `wa`	CPU % idle / idle-but-waiting-for-I/O	High wa means the CPU has work it cannot start until the disk answers

The r and b pair deserves a sentence more, because they look alike and mean opposite things. r is appetite for CPU: those processes have everything they need except a core to run on. b is the D-state count, processes the kernel will not even let a signal interrupt because they are mid-flight inside an I/O operation. A loaded-but-healthy machine has r bouncing around the core count and b at zero. A thrashing machine often inverts that: r modest, b climbing, because everyone is queued behind the disk that swap lives on. (These are also the two numbers that feed load average, which is why a thrashing box shows a scary load with a mostly idle CPU.)

si and so are the heart of the tool. so alone, in bursts, is unremarkable: the kernel proactively pushing cold pages out to make room is housekeeping. si alone, briefly, is also fine: something touched a page that was swapped out long ago. The pathological pattern is both columns nonzero, sample after sample. That means pages are being evicted and then needed again within seconds, the working set genuinely does not fit in RAM, and every major page fault costs a disk round-trip. You will see wa rise and id fall in the same rows, because the CPU spends its time waiting for swap I/O instead of doing work. One screen of vmstat 1 showing that pattern is sufficient evidence of real memory pressure; no further philosophy required.

bi and bo are the context line. They count all block I/O, not just swap, so a backup job or a database checkpoint will light up bo with no memory story at all. Their value here is correlation: when si/so and bi/bo surge together while cache shrinks, you are watching the eviction machinery work in real time.

Three production scenarios

"The box has no free memory!"

A monitoring alert fires on free memory below 5%, or a teammate runs free, sees a tiny number in the free column, and declares an emergency. The two-step check settles it in thirty seconds:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi        9.8Gi       412Mi       180Mi        21Gi        20Gi
Swap:          8.0Gi          0B       8.0Gi
$ vmstat 1 5
 r  b   swpd   free   buff   cache   si   so   bi    bo   in    cs us sy id wa st
 1  0      0 421888 102400 21893120   0    0   18    22  180   420  7  2 90  1  0
 0  0      0 419840 102400 21893120   0    0    0    64 2100  3900  5  2 92  1  0
 1  0      0 422912 102400 21893188   0    0    8    40 2240  4100  6  2 91  1  0

Available is 20 Gi, swap is untouched, si and so are flat zero, wa is 1. The 21 Gi in buff/cache is file data the kernel is keeping warm — this machine probably serves a lot of reads from disk, and the cache is why those reads are fast. There is no incident here; there is an alert that was written against the wrong column. The durable fix is to re-point the monitor at MemAvailable (or the available column) instead of free, because an alert on free memory on a Linux box is an alert on whether the machine has been up long enough for the cache to fill.

Steady si/so: the real thing

Different box, different afternoon. Latency on a service has tripled, the CPU graph looks oddly idle, and load average is high anyway. vmstat 1 tells the story at a glance:

$ vmstat 1 5
 r  b    swpd   free  buff   cache    si    so    bi    bo   in    cs us sy id wa st
 2  1 7340032 142848  4096  301568  1820  2440  2200  2600 3100  5200 14  6 72  8  0
 1  5 7398220  98304  4096  264192 18432 22528 19100 23300 8900 11200  6  9 24 61  0
 2  4 7455232 101376  4096  248832 16896 20480 17400 21000 8400 10800  7  8 26 59  0
 1  6 7512064  97280  4096  240640 19968 21504 20500 22100 9100 11500  5 10 22 63  0

Every flag from the decoder is up at once: si and so both in the tens of megabytes per second and staying there, b at four to six processes stuck in D state, wa above 60, the cache column being squeezed smaller each second as the kernel scavenges everything reclaimable. The working set no longer fits. Note what the CPU columns say: us is single digits. The machine is not busy, it is blocked, and that is why the latency graph and the CPU graph stopped agreeing. From here the job changes tools: this pair told you that the machine is out of memory, and ranking who is responsible belongs to top & htop (sort by resident set) and the longer walk in what's eating my memory?. The short-term mitigations are the obvious ones — restart or shrink the hog, shed load — because no tuning flag makes a working set fit in RAM that it does not fit in.

The OOM killer struck overnight

Third shape: a service was dead this morning, and by the time anyone looked, memory was fine. free shows plenty available, vmstat 1 is quiet, and the only oddity is vmstat -s reporting a few million pages swapped out on a box that should never swap. That is the post-mortem pattern. Memory pressure built during the night, the kernel swapped until it could not, and then the OOM killer picked the process with the worst score — usually the biggest anonymous memory user, which is usually your service — and killed it to keep the kernel alive. The evidence is not in either of these tools; it is in the kernel log:

$ journalctl -k | grep -i 'out of memory'
Mar 14 03:12:09 app-7 kernel: Out of memory: Killed process 41327 (java) total-vm:28412992kB,
anon-rss:24117248kB, file-rss:1024kB, shmem-rss:0kB, UID:1002 pgtables:49152kB oom_score_adj:0

The line names the victim, and anon-rss (24 GB of anonymous memory, far more than swap could absorb fast enough) names the reason. Reading the surrounding dump, and the rest of the kernel's account of the night, is covered in journalctl & dmesg. The lesson for this page is about timing: free and vmstat 1 only see the present. If the incident is over, the snapshot will be innocent, and the history lives in vmstat -s counters and the logs. Check the logs before declaring a mystery.

What's underneath

Everything above falls out of one distinction the kernel makes between two kinds of pages. File-backed pages have a home on disk: program text, libraries, any file read or mapped. The page cache is made of these. Anonymous pages have no file behind them: heap allocations, stacks, the contents of malloc. When memory gets tight, the kernel must reclaim pages, and the two kinds cost very different amounts to evict. A clean file-backed page can simply be dropped — the data still exists in the file, so reclaiming it is free, and re-reading it later is an ordinary disk read. An anonymous page has no copy anywhere, so the only way to evict it is to write it to swap first, and the only way to get it back is to read it from swap later. That asymmetry is why the page cache is the first thing sacrificed under pressure, why buff/cache shrinking is an early warning, and why si/so traffic means pressure has burned through the cheap options and reached the expensive ones.

Why the cache goes first. Dropping a clean file page costs nothing; evicting an anonymous page costs a write now and a read later. si and so measure the expensive loop.

The knob that biases this choice is vm.swappiness, a sysctl from 0 to 200 (default 60) that tells the reclaim code how willing to be to swap anonymous pages rather than drop file pages. Low values keep application memory in RAM at the cost of a smaller cache; high values protect the cache at the cost of swapping the heap. It is a preference, not a capacity tool: no swappiness setting changes how much memory exists, and tuning it on a thrashing machine rearranges deck chairs. Databases sometimes set it low because they run their own caches and would rather the kernel never touch their buffer pool — the relationship between an application's own cache and the kernel's is its own subject, covered from the database side in the page cache.

One level further down, all of this is the virtual memory system doing its job: every process sees a private address space, pages of which may live in RAM, in swap, in a file, or nowhere yet, and a page fault is the mechanism that pulls them in on demand. The full machinery — page tables, faults, the reclaim LRU lists — lives in virtual memory and memory management. And when you need the same anonymous-versus-file breakdown for one specific process rather than the whole machine, /proc/PID/smaps itemises every mapping; the tour is in /proc.

Pitfalls

Reading the first vmstat line. Worth repeating because it produces confident wrong answers in both directions. The first line of vmstat 1 is the average since boot. On a machine that thrashed badly last Tuesday, that line shows swap activity now, and someone "confirms" an incident that ended days ago. On a machine that started thrashing five minutes ago after forty days of calm, the since-boot average dilutes the disaster to a rounding error, and someone rules out the actual problem. Discard line one, read from line two, and when you paste vmstat output into a ticket, paste at least five intervals so the reader can see the trend rather than one possibly unlucky second.

Confusing swap usage with swapping activity. swpd in vmstat and the used cell of free's Swap row are stocks: they measure how much has accumulated in swap over the machine's whole history. si and so are flows: they measure movement during the last interval. A box with 4 GB sitting in swap and si/so at zero is healthy — the kernel parked cold pages there once and nothing has missed them, which is swap working as intended. A box with 200 MB in swap and sustained si/so is in trouble. "We're using swap!" is not a finding. "We're swapping, right now, continuously" is.

Containers seeing the host's memory. Inside a container, free and vmstat read /proc/meminfo and /proc/vmstat, and those files describe the host, not the cgroup the container actually lives in. A container limited to 2 GB on a 256 GB host will cheerfully report a couple of hundred gigabytes available right up until the cgroup controller kills it for touching its 2 GB ceiling. For the truth, read the cgroup files: memory.current and memory.max under /sys/fs/cgroup (cgroup v2), and memory.stat for the cache-versus-anonymous split within the limit. The same trap catches JVMs and other runtimes that size their heaps from "system" memory, and it catches monitoring agents that run inside the container they are supposed to be watching.

Expecting free to name the culprit. Both of these tools are system-wide by design. They can tell you the machine is short of memory and actively paying for it, and they cannot tell you which process is responsible, because no column in either output has a PID in it. The number of incident channels where someone runs free -h four times in a row hoping for a different shape is larger than it should be. Once the verdict is in, switch tools: per-process ranking is top & htop's job.

A drill you can run right now

Everything below is safe on any Linux machine you are allowed to log into: it reads counters, writes one throwaway file to /tmp, and runs one easily killed process. Ten minutes, and the cache behaviour and the CPU columns stop being trivia.

Step 1 — watch the cache eat a file. Take a snapshot, create a 1 GB file, snapshot again:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       3.1Gi       9.6Gi        96Mi       2.8Gi        12Gi
$ dd if=/dev/zero of=/tmp/ballast bs=1M count=1024
1024+0 records in
1024+0 records out
$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       3.1Gi       8.6Gi        96Mi       3.8Gi        12Gi
$ rm /tmp/ballast

Three numbers moved and one did not, and the pattern is the whole lesson. free fell by a gigabyte and buff/cache rose by a gigabyte, because the kernel kept the file's pages in the cache after writing them — if something reads /tmp/ballast next, it will come from RAM. And available barely moved, because cached copies of file data are reclaimable and the kernel knows it. If your mental model were "free = headroom," this experiment just cost you a gigabyte of headroom; in the available model, it cost approximately nothing, which is the correct answer.

Step 2 — the drop_caches aside. There is a switch that empties the page cache on demand: echo 3 | sudo tee /proc/sys/vm/drop_caches. On a scratch VM it is instructive — run it after step 1 and watch buff/cache collapse and free balloon, proving the cache really was reclaimable. Do not run it on anything production-shaped. It does not free memory in any sense that matters (available already counted it), and it does make every subsequent file read a disk read until the cache rewarms, which on a database or a busy file server shows up as a latency cliff. Its legitimate uses are benchmarking from a cold cache and demonstrations like this one.

Step 3 — give vmstat something to look at. Start a CPU burner, watch the columns react, kill it:

$ yes > /dev/null &
[1] 53210
$ vmstat 1 4
 r  b   swpd   free   buff   cache   si   so   bi   bo   in    cs us sy id wa st
 1  0      0 9011200 88320 3964928   0    0   12   18  160   390  6  2 91  1  0  <- ignore
 2  0      0 9011200 88320 3964928   0    0    0    8 1840  2400 31 68  1  0  0
 2  0      0 9010688 88320 3964928   0    0    0    0 1790  2350 30 69  1  0  0
 2  0      0 9010688 88320 3964928   0    0    0   12 1810  2380 32 67  1  0  0
$ kill %1

yes pins one core, so r sits at 2 (the burner plus whatever else wanted a CPU that second) and one core's worth of idle vanishes. Notice the split: sy dwarfs us, because yes spends most of its life inside the write() system call rather than computing anything. Notice also what stayed flat: si, so, b, and wa did not move, because burning CPU is not memory pressure. You have now seen, on a quiet machine, exactly which columns belong to which kind of trouble — which is the skill the incident channel actually needs.

If you remember one line. Capacity is free -h, and the only column that means headroom is available. Suffering is vmstat 1, skip the first line, and the thrash signal is si and so both nonzero, sample after sample.

free & vmstat

The question it answers

The flags that matter

Reading free: why "free" is a lie

Reading vmstat: the columns that matter

Three production scenarios

"The box has no free memory!"

Steady si/so: the real thing

The OOM killer struck overnight

What's underneath

Pitfalls

A drill you can run right now

Further reading

03 — lsof