free & vmstat
Someone pastes a graph in the incident channel: free memory has been falling for days and
is nearly at zero. Is the box about to die? Almost certainly not — but it might be, and the
difference between "the kernel is caching files, as designed" and "this machine is swapping
itself to death" is the difference between closing the ticket and paging the on-call.
free tells you where the memory went; vmstat tells you whether the
machine is suffering for it. This page covers the handful of invocations worth knowing,
decodes the columns that actually matter, walks three production scenarios, and ends with a
drill you can run on any machine without breaking anything.
The question it answers
There are two memory questions on a Linux box and they get conflated all the time. The first
is an inventory question: where has the memory gone? The second is a behaviour
question: is anything suffering because of it? free answers the first
with a one-shot snapshot of what the kernel is doing with every byte of RAM. vmstat
answers the second by sampling the machine over time and showing you whether memory pressure
is translating into work: pages moving to and from swap, processes blocked on I/O, CPU time
burned waiting. You need both, because each one is misleading without the other.
The conflation matters because Linux deliberately runs with very little free memory. Any RAM that nothing has claimed gets used to cache file contents, on the theory that an unused byte of RAM is a wasted byte of RAM. So on a healthy machine that has been up for a while, free memory trends toward a small number and stays there, and the graph that looks like a slow leak is usually just the page cache filling up. The kernel will hand that cached memory back the moment a process asks for it. A box with 2% free memory can be perfectly fine; a box with 20% free can be minutes from the OOM killer. The free-memory number alone cannot tell you which one you have.
What can tell you is movement. A machine that is genuinely short of memory does not sit
still: it starts evicting things it will need again, writing anonymous pages out to swap and
reading them back in, over and over. That churn is called thrashing, and it has a precise
signature in vmstat's si and so columns. The mental
model for this page is therefore a two-step check. Step one, free -h: is the
available number (not the free number) actually low? Step two, vmstat 1:
is the machine actively swapping right now? Low available plus sustained swap traffic is a
real memory problem. Everything else is probably the cache doing its job. The tools that
come later in an investigation — top & htop
to rank processes, the full hunt in
what's eating my memory? —
start from the verdict these two give you.
The flags that matter
Both tools are small, and the useful surface is smaller still. Here is the working set.
| Invocation | What it shows | When you reach for it |
|---|---|---|
free -h | One snapshot of memory and swap, in human units | First question on any box: where is the memory? |
free -w | Same, but buffers and cache as separate columns | When you care which kind of cache is large (rare) |
free -h -s 2 | The snapshot repeated every 2 seconds | Watching a number move while you provoke it |
vmstat 1 | One line of system-wide counters per second, forever | The behaviour question: is it thrashing right now? |
vmstat 1 10 | Same, but stops after 10 samples | Pasting evidence into a ticket |
vmstat -s | Every counter since boot, one per line, labelled | Totals and history: how much swapping has ever happened |
The interval form is the one to internalise. A bare vmstat with no arguments
prints a single line, and that line is close to useless for the thrashing question, because
most of its columns are averages since boot. A machine that swapped heavily during
one bad hour three weeks ago will show nonzero swap activity in that first line forever
after, and a machine that started thrashing two minutes ago will barely move it. The same
first-line problem applies to vmstat 1 itself: the first row of output is the
since-boot average, and only the second row onward shows what happened during the last
interval. Throw the first line away, every time. It is the single most common misreading of
this tool, and it earns a fuller treatment in the pitfalls below.
vmstat -s is the complement: it is supposed to be since-boot. When you
want to know whether a machine has ever swapped at all, or roughly how much, the
pages swapped in / pages swapped out lines give you the lifetime
totals without any sampling. Both tools read their numbers from
/proc/meminfo and /proc/vmstat; there is no agent and no daemon,
just formatted kernel counters, which is why they cost almost nothing to run and why they
are safe to point at a struggling box. More on the /proc side at
/proc.
Reading free: why "free" is a lie
Here is free -h on a reasonably busy 32 GB application server. The row to
read is Mem; the column to read is not the one your eye goes to first.
$ free -h total used free shared buff/cache available Mem: 31Gi 12Gi 1.2Gi 340Mi 18Gi 17Gi Swap: 8.0Gi 1.1Gi 6.9Gi
Read it column by column. total is physical RAM minus what the kernel
reserved for itself at boot. used is memory held by processes and the
kernel that is not cache: heaps, stacks, page tables, slab. free is memory
that nobody is using for anything at all, not even caching. shared is
mostly tmpfs: files living in RAM-backed filesystems like /dev/shm, plus shared
memory segments. buff/cache is the page cache and friends: file contents
the kernel keeps in otherwise-idle RAM so the next read does not touch the disk.
available is the kernel's own estimate of how much memory a new workload
could claim before the system starts swapping.
The trap is the gap between free and available. On this box, free
is 1.2 Gi, which looks alarming, and available is 17 Gi, which is the truth. The
18 Gi of buff/cache is not spoken for; it is opportunistic. Most of it is clean page
cache — copies of file data that also exist on disk — and the kernel can drop those pages
instantly, without writing anything anywhere, the moment a process needs the space. The
free column counts only the RAM the kernel has not found any use for yet, which on a
long-running machine is deliberately almost nothing. Treating that column as headroom is
the classic mistake, and old habits make it worse: for years the folk remedy was the
"-/+ buffers/cache" arithmetic from older versions of free. The
available column made that arithmetic obsolete. It comes straight from
MemAvailable in /proc/meminfo, where the kernel does the honest
version of the calculation for you.
Honest, but still an estimate, and it is worth knowing what is inside it. Available is roughly free memory plus the part of the page cache and slab that is cheap to reclaim, minus a watermark the kernel insists on keeping in reserve. It is not free plus all of buff/cache, because not all of buff/cache can be dropped. Dirty pages must be written back first. Pages belonging to files that are mapped and hot will just get faulted straight back in. tmpfs lives inside the cache accounting but is real data with no disk copy behind it, so it cannot be reclaimed at all (it can only be swapped). That is why available on this box is 17 Gi and not 1.2 + 18 = 19 Gi. The diagram below is the picture worth keeping.
The Swap row reads the same way as Mem but answers a different question, and the
distinction is one of this page's main points: swap used is a stock, not a flow.
This box has 1.1 Gi sitting in swap, and by itself that means nothing about the present.
Those pages may have been pushed out during a deploy last month and never touched since,
which is swap doing exactly what it is for: holding cold anonymous pages so warm ones can
have the RAM. Whether swap is being used right now is a question free
cannot answer. For that you need the flow, and the flow lives in vmstat.
Reading vmstat: the columns that matter
vmstat 1 prints a header and then one line per second until you stop it. Here
is the same 32 GB box on a normal afternoon. Remember: line one is the since-boot
average, so the real data starts on line two.
$ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 1153434 1264640 91432 18874368 1 1 24 31 210 480 9 3 87 1 0 <- since boot: ignore 1 0 1153434 1259520 91432 18874368 0 0 0 116 4210 7110 12 4 83 1 0 3 0 1153434 1248212 91432 18874604 0 0 12 204 4677 8021 14 5 80 1 0 1 0 1153434 1251020 91432 18874604 0 0 0 88 3980 6850 11 4 84 1 0
Sixteen-odd columns, but only a handful carry the diagnosis. Here is the decoder, grouped the way the header groups them.
| Column | What it counts | How to read it |
|---|---|---|
r | Processes runnable: running or waiting for a CPU | Persistently above the core count means CPU contention |
b | Processes in uninterruptible sleep (D state), almost always blocked on I/O | Persistently nonzero means something is stuck waiting on disk or NFS |
swpd | Total swap in use (a stock) | History, not pressure — see the pitfalls |
si / so | KiB/s swapped in from disk / out to disk (a flow) | Both sustained nonzero is the thrash signal |
bi / bo | Blocks/s read from / written to block devices | General disk traffic; spikes are normal, walls are not |
us / sy | CPU % in user code / in the kernel | High sy with high si/so is the kernel fighting for memory |
id / wa | CPU % idle / idle-but-waiting-for-I/O | High wa means the CPU has work it cannot start until the disk answers |
The r and b pair deserves a sentence more, because they look alike
and mean opposite things. r is appetite for CPU: those processes have
everything they need except a core to run on. b is the D-state count, processes
the kernel will not even let a signal interrupt because they are mid-flight inside an I/O
operation. A loaded-but-healthy machine has r bouncing around the core count
and b at zero. A thrashing machine often inverts that: r modest,
b climbing, because everyone is queued behind the disk that swap lives on.
(These are also the two numbers that feed load average, which is why a thrashing box shows
a scary load with a mostly idle CPU.)
si and so are the heart of the tool. so alone, in
bursts, is unremarkable: the kernel proactively pushing cold pages out to make room is
housekeeping. si alone, briefly, is also fine: something touched a page that
was swapped out long ago. The pathological pattern is both columns nonzero, sample
after sample. That means pages are being evicted and then needed again within seconds,
the working set genuinely does not fit in RAM, and every major page fault costs a disk
round-trip. You will see wa rise and id fall in the same rows,
because the CPU spends its time waiting for swap I/O instead of doing work. One screen of
vmstat 1 showing that pattern is sufficient evidence of real memory pressure;
no further philosophy required.
bi and bo are the context line. They count all block I/O,
not just swap, so a backup job or a database checkpoint will light up bo with
no memory story at all. Their value here is correlation: when si/so
and bi/bo surge together while cache shrinks, you are
watching the eviction machinery work in real time.
Three production scenarios
"The box has no free memory!"
A monitoring alert fires on free memory below 5%, or a teammate runs free,
sees a tiny number in the free column, and declares an emergency. The two-step check
settles it in thirty seconds:
$ free -h total used free shared buff/cache available Mem: 31Gi 9.8Gi 412Mi 180Mi 21Gi 20Gi Swap: 8.0Gi 0B 8.0Gi $ vmstat 1 5 r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 421888 102400 21893120 0 0 18 22 180 420 7 2 90 1 0 0 0 0 419840 102400 21893120 0 0 0 64 2100 3900 5 2 92 1 0 1 0 0 422912 102400 21893188 0 0 8 40 2240 4100 6 2 91 1 0
Available is 20 Gi, swap is untouched, si and so are flat
zero, wa is 1. The 21 Gi in buff/cache is file data the kernel is keeping
warm — this machine probably serves a lot of reads from disk, and the cache is why those
reads are fast. There is no incident here; there is an alert that was written against the
wrong column. The durable fix is to re-point the monitor at MemAvailable
(or the available column) instead of free, because an alert on free memory on a Linux box
is an alert on whether the machine has been up long enough for the cache to fill.
Steady si/so: the real thing
Different box, different afternoon. Latency on a service has tripled, the CPU graph looks
oddly idle, and load average is high anyway. vmstat 1 tells the story at a
glance:
$ vmstat 1 5 r b swpd free buff cache si so bi bo in cs us sy id wa st 2 1 7340032 142848 4096 301568 1820 2440 2200 2600 3100 5200 14 6 72 8 0 1 5 7398220 98304 4096 264192 18432 22528 19100 23300 8900 11200 6 9 24 61 0 2 4 7455232 101376 4096 248832 16896 20480 17400 21000 8400 10800 7 8 26 59 0 1 6 7512064 97280 4096 240640 19968 21504 20500 22100 9100 11500 5 10 22 63 0
Every flag from the decoder is up at once: si and so both in the
tens of megabytes per second and staying there, b at four to six processes
stuck in D state, wa above 60, the cache column being squeezed smaller each
second as the kernel scavenges everything reclaimable. The working set no longer fits.
Note what the CPU columns say: us is single digits. The machine is not busy,
it is blocked, and that is why the latency graph and the CPU graph stopped agreeing.
From here the job changes tools: this pair told you that the machine is out of
memory, and ranking who is responsible belongs to
top & htop (sort by resident set) and
the longer walk in
what's eating my memory?.
The short-term mitigations are the obvious ones — restart or shrink the hog, shed load —
because no tuning flag makes a working set fit in RAM that it does not fit in.
The OOM killer struck overnight
Third shape: a service was dead this morning, and by the time anyone looked, memory was
fine. free shows plenty available, vmstat 1 is quiet, and the only
oddity is vmstat -s reporting a few million pages swapped out on
a box that should never swap. That is the post-mortem pattern. Memory pressure built
during the night, the kernel swapped until it could not, and then the OOM killer picked the
process with the worst score — usually the biggest anonymous memory user, which is usually
your service — and killed it to keep the kernel alive. The evidence is not in either of
these tools; it is in the kernel log:
$ journalctl -k | grep -i 'out of memory' Mar 14 03:12:09 app-7 kernel: Out of memory: Killed process 41327 (java) total-vm:28412992kB, anon-rss:24117248kB, file-rss:1024kB, shmem-rss:0kB, UID:1002 pgtables:49152kB oom_score_adj:0
The line names the victim, and anon-rss (24 GB of anonymous memory, far more than swap
could absorb fast enough) names the reason. Reading the surrounding dump, and the rest of the
kernel's account of the night, is covered in
journalctl & dmesg. The lesson
for this page is about timing: free and vmstat 1 only see the
present. If the incident is over, the snapshot will be innocent, and the history lives in
vmstat -s counters and the logs. Check the logs before declaring a mystery.
What's underneath
Everything above falls out of one distinction the kernel makes between two kinds of pages.
File-backed pages have a home on disk: program text, libraries, any file read or
mapped. The page cache is made of these. Anonymous pages have no file behind them:
heap allocations, stacks, the contents of malloc. When memory gets tight, the
kernel must reclaim pages, and the two kinds cost very different amounts to evict. A clean
file-backed page can simply be dropped — the data still exists in the file, so reclaiming
it is free, and re-reading it later is an ordinary disk read. An anonymous page has no copy
anywhere, so the only way to evict it is to write it to swap first, and the only way to get
it back is to read it from swap later. That asymmetry is why the page cache is the first
thing sacrificed under pressure, why buff/cache shrinking is an early warning, and why
si/so traffic means pressure has burned through the cheap options
and reached the expensive ones.
The knob that biases this choice is vm.swappiness, a sysctl from 0 to 200
(default 60) that tells the reclaim code how willing to be to swap anonymous pages rather
than drop file pages. Low values keep application memory in RAM at the cost of a smaller
cache; high values protect the cache at the cost of swapping the heap. It is a preference,
not a capacity tool: no swappiness setting changes how much memory exists, and tuning it on
a thrashing machine rearranges deck chairs. Databases sometimes set it low because they run
their own caches and would rather the kernel never touch their buffer pool — the relationship
between an application's own cache and the kernel's is its own subject, covered from the
database side in the page cache.
One level further down, all of this is the virtual memory system doing its job: every
process sees a private address space, pages of which may live in RAM, in swap, in a file,
or nowhere yet, and a page fault is the mechanism that pulls them in on demand. The
full machinery — page tables, faults, the reclaim LRU lists — lives in
virtual memory
and memory
management. And when you need the same anonymous-versus-file breakdown for one specific
process rather than the whole machine, /proc/PID/smaps itemises every mapping;
the tour is in /proc.
Pitfalls
Reading the first vmstat line. Worth repeating because it produces
confident wrong answers in both directions. The first line of vmstat 1 is the
average since boot. On a machine that thrashed badly last Tuesday, that line shows swap
activity now, and someone "confirms" an incident that ended days ago. On a machine that
started thrashing five minutes ago after forty days of calm, the since-boot average dilutes
the disaster to a rounding error, and someone rules out the actual problem. Discard line
one, read from line two, and when you paste vmstat output into a ticket, paste at least
five intervals so the reader can see the trend rather than one possibly unlucky second.
Confusing swap usage with swapping activity. swpd in vmstat
and the used cell of free's Swap row are stocks: they measure how much has accumulated in
swap over the machine's whole history. si and so are flows: they
measure movement during the last interval. A box with 4 GB sitting in swap and
si/so at zero is healthy — the kernel parked cold pages there once
and nothing has missed them, which is swap working as intended. A box with 200 MB in
swap and sustained si/so is in trouble. "We're using swap!" is
not a finding. "We're swapping, right now, continuously" is.
Containers seeing the host's memory. Inside a container,
free and vmstat read /proc/meminfo and
/proc/vmstat, and those files describe the host, not the cgroup the
container actually lives in. A container limited to 2 GB on a 256 GB host will
cheerfully report a couple of hundred gigabytes available right up until the cgroup
controller kills it for touching its 2 GB ceiling. For the truth, read the cgroup
files: memory.current and memory.max under
/sys/fs/cgroup (cgroup v2), and memory.stat for the
cache-versus-anonymous split within the limit. The same trap catches JVMs and other
runtimes that size their heaps from "system" memory, and it catches monitoring agents that
run inside the container they are supposed to be watching.
Expecting free to name the culprit. Both of these tools are system-wide
by design. They can tell you the machine is short of memory and actively paying for it,
and they cannot tell you which process is responsible, because no column in either output
has a PID in it. The number of incident channels where someone runs free -h
four times in a row hoping for a different shape is larger than it should be. Once the
verdict is in, switch tools: per-process ranking is
top & htop's job.
A drill you can run right now
Everything below is safe on any Linux machine you are allowed to log into: it reads
counters, writes one throwaway file to /tmp, and runs one easily killed
process. Ten minutes, and the cache behaviour and the CPU columns stop being trivia.
Step 1 — watch the cache eat a file. Take a snapshot, create a 1 GB file, snapshot again:
$ free -h total used free shared buff/cache available Mem: 15Gi 3.1Gi 9.6Gi 96Mi 2.8Gi 12Gi $ dd if=/dev/zero of=/tmp/ballast bs=1M count=1024 1024+0 records in 1024+0 records out $ free -h total used free shared buff/cache available Mem: 15Gi 3.1Gi 8.6Gi 96Mi 3.8Gi 12Gi $ rm /tmp/ballast
Three numbers moved and one did not, and the pattern is the whole lesson. free fell by a
gigabyte and buff/cache rose by a gigabyte, because the kernel kept the file's pages in the
cache after writing them — if something reads /tmp/ballast next, it will come
from RAM. And available barely moved, because cached copies of file data are reclaimable
and the kernel knows it. If your mental model were "free = headroom," this experiment just
cost you a gigabyte of headroom; in the available model, it cost approximately nothing,
which is the correct answer.
Step 2 — the drop_caches aside. There is a switch that empties the page
cache on demand: echo 3 | sudo tee /proc/sys/vm/drop_caches. On a scratch VM
it is instructive — run it after step 1 and watch buff/cache collapse and free balloon,
proving the cache really was reclaimable. Do not run it on anything production-shaped. It
does not free memory in any sense that matters (available already counted it), and it does
make every subsequent file read a disk read until the cache rewarms, which on a database or
a busy file server shows up as a latency cliff. Its legitimate uses are benchmarking from a
cold cache and demonstrations like this one.
Step 3 — give vmstat something to look at. Start a CPU burner, watch the columns react, kill it:
$ yes > /dev/null & [1] 53210 $ vmstat 1 4 r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 9011200 88320 3964928 0 0 12 18 160 390 6 2 91 1 0 <- ignore 2 0 0 9011200 88320 3964928 0 0 0 8 1840 2400 31 68 1 0 0 2 0 0 9010688 88320 3964928 0 0 0 0 1790 2350 30 69 1 0 0 2 0 0 9010688 88320 3964928 0 0 0 12 1810 2380 32 67 1 0 0 $ kill %1
yes pins one core, so r sits at 2 (the burner plus whatever else
wanted a CPU that second) and one core's worth of idle vanishes. Notice the split:
sy dwarfs us, because yes spends most of its life
inside the write() system call rather than computing anything. Notice also
what stayed flat: si, so, b, and wa did
not move, because burning CPU is not memory pressure. You have now seen, on a quiet
machine, exactly which columns belong to which kind of trouble — which is the skill the
incident channel actually needs.
free -h, and the only
column that means headroom is available. Suffering is vmstat 1, skip
the first line, and the thrash signal is si and so both nonzero,
sample after sample.Further reading
- free(1) and vmstat(8) — both short; the FIELD DESCRIPTIONS section of vmstat(8) is the authoritative column list.
- proc_meminfo(5) — what every line of /proc/meminfo means, including the exact definition of MemAvailable.
- linuxatemyram.com — the canonical one-page rebuttal to "Linux ate my RAM," good for sending to the incident channel.
- Kernel docs — sysctl/vm — swappiness, drop_caches, and the rest of the vm knobs, with the maintainers' own warnings attached.
- Semicolony — Memory management — the reclaim machinery this page keeps gesturing at, in full.