What's eating my memory?
The dashboard says 94% used. Someone has already typed "memory leak" in the incident channel. Maybe. Or maybe the kernel is doing exactly what it should and the only thing leaking is confidence in the graph. Memory complaints are the most over-reported and under-diagnosed symptom in Linux operations, because the obvious numbers — "used", RSS, the percent gauge — all lie in well-documented ways. This page walks the investigation step by step: each stop is one command, the output you will actually see, and the decision that output forces. By the end you have one of four named causes instead of a vibe.
Step 1 — is it actually full?
Every memory investigation starts with the same triage question, and most of them end
there too: is the machine under memory pressure, or is the page cache doing its job?
Linux treats free RAM as wasted RAM. Any page not needed by a process gets used to
cache file contents, because a cached read is thousands of times cheaper than a disk
read, and the kernel can hand those pages back the moment a process asks for them.
A healthy server runs near 100% "used" by design. So the first command is
free -h, and the discipline is to ignore the column everyone stares at:
$ free -h total used free shared buff/cache available Mem: 15Gi 9.8Gi 442Mi 210Mi 5.4Gi 5.1Gi Swap: 2.0Gi 128Mi 1.9Gi
Read it right to left. available is the kernel's own estimate of how
much memory could be handed to a new allocation without swapping: free pages plus the
part of the cache and other reclaimable memory it can drop on demand. That is the
number that decides whether this investigation continues. buff/cache
is the page cache plus kernel buffers — the 5.4 GiB here is not "eaten", it is on
loan. free at 442 MiB looks alarming and means almost nothing;
on a long-running box, free is always small because the kernel put every idle page to
work. The dashboard that paged you is probably plotting
used / total, which counts the loan as spent.
The decision this forces: if available is comfortable — say a third of total, as here —
there is no memory problem, and you should be asking why the alert fired, not what is
eating RAM. If available is genuinely small (a few percent of total) or swap usage is
climbing, the pressure is real and you move to step 2. The full anatomy of this
output, including what vmstat adds when you want to watch pressure move
rather than glance at it, is on
free & vmstat. One quick
corroborating signal worth knowing: vmstat 1 with sustained non-zero
si/so columns means the kernel is actively swapping, and that
is pressure no matter what any gauge says.
Step 2 — who is holding it?
Real pressure confirmed, so now the question becomes attribution. The blunt instrument is good enough to start: rank every process by resident set size, the amount of physical RAM each one currently has mapped.
$ ps aux --sort=-%mem | head -6 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND app 31415 12.4 41.2 19075184 6571236 ? Ssl Jun02 412:11 java -Xmx5g -jar api.jar postgres 2204 1.1 9.0 2167040 1438112 ? Ss May14 88:02 postgres: checkpointer app 31488 0.8 3.1 1204416 498332 ? Sl Jun02 21:40 node worker.js app 31489 0.7 3.1 1199288 496104 ? Sl Jun02 20:55 node worker.js app 31490 0.7 3.0 1198472 494817 ? Sl Jun02 21:02 node worker.js
The same view, live: run top, press M (capital), and the
table re-sorts by memory and stays sorted while you watch. Two columns matter and one
is a trap. RSS is resident physical memory in kilobytes — real pages
in real RAM right now. VSZ is virtual address space, and it is the
trap: it counts every mapping the process has ever asked for, including reserved but
untouched regions, mapped files, and arenas the allocator grabbed speculatively. A JVM
with a 19 GB VSZ on a 15 GB machine is normal; a 19 GB RSS would be
impossible. Sort by RSS, reason about RSS, and treat VSZ as trivia unless you are
debugging address-space exhaustion specifically.
But RSS has its own lie, and on this box it is sitting in rows three through five.
Those node workers each report ~490 MB resident, so the table
suggests they cost 1.5 GB together. They do not. RSS counts every page resident
in the process's address space, including pages shared with other processes:
the node binary, libc, every shared library, and any memory the workers inherited from
a common parent and have not written to since the fork. Each shared page gets counted
once per process that maps it. Sum the RSS column on a machine running a prefork
server with thirty workers and you can "account for" more RAM than the machine has.
The decision this step forces: if one private-heap process dominates (the JVM here, at
6.4 GB against its 5 GB heap flag — interesting), RSS is honest enough and
you can move on. If the suspects are a fleet of sibling workers, you need the
per-process truth, which is the next step.
Step 3 — RSS, PSS, USS: the per-process truth
The kernel keeps the honest ledger in /proc/PID/smaps, which lists every
memory mapping the process has with a dozen counters each. Reading it raw is for
masochists; the kernel pre-sums it for you in smaps_rollup:
$ cat /proc/31488/smaps_rollup 55d8e2a00000-7ffec1b9d000 ---p 00000000 00:00 0 [rollup] Rss: 498332 kB Pss: 361409 kB Shared_Clean: 189204 kB Shared_Dirty: 12080 kB Private_Clean: 8112 kB Private_Dirty: 288936 kB Referenced: 441208 kB Anonymous: 297048 kB Swap: 0 kB SwapPss: 0 kB
Three numbers, three different questions answered. RSS (498 MB)
is what ps showed: every resident page, shared ones counted in full.
PSS, proportional set size (361 MB), splits each shared page's
cost evenly among the processes mapping it — a page shared by four processes charges
each one a quarter. PSS columns are additive: sum PSS across all processes and you get
a true total that cannot exceed physical RAM, which makes it the right number for
"how much do these thirty workers cost together?" USS, unique set
size, is Private_Clean + Private_Dirty (297 MB here): pages this
process alone holds, the amount of RAM that would be freed if you killed it. USS is
the right number for "what do I get back if this process goes away?" and it is also
the most honest leak metric, because leaked allocations are private by nature.
On this box the arithmetic settles the worker question: three workers at ~490 MB
RSS look like 1.5 GB, but at ~300 MB USS each plus one shared copy of the
runtime, the real bill is closer to 1 GB. Not the problem. The JVM with
6.4 GB of almost entirely private, dirty memory still is. The general rule: RSS
to rank suspects fast, PSS to budget a fleet, USS to size the refund. The
/proc filesystem these files live in gets a full tour at
/proc, and the machinery behind these
counters — pages, mappings, copy-on-write after fork — is the subject of
virtual
memory.
The decision this step forces: you now have one named suspect with a real, private footprint. The next question is not "how big is it" but "which way is it moving."
Step 4 — leak or plateau? Watch the shape
A single RSS reading cannot distinguish a leak from a process that is merely large. Leaks are a verb, not a noun: memory that grows without bound under steady load. So you sample over time. No tooling required beyond a loop:
$ while sleep 60; do echo "$(date +%H:%M) $(ps -o rss= -p 31415)"; done 14:02 6571236 14:03 6588412 14:04 6601780 14:05 6615924 14:06 6633108 …ten minutes pass… 14:16 6772040
Then read the shape, not the values. Processes with managed heaps (JVM, Go, Node, Python) breathe: RSS climbs as garbage accumulates, drops when a collection runs, climbs again. Plotted, that is a sawtooth, and a sawtooth oscillating around a flat plateau is a healthy process, however large the plateau is. A leak looks different: a staircase. Each step up is permanent; collections still run, but the floor after each one is higher than the floor before, because the leaked objects are still referenced (or were allocated outside the managed heap entirely) and nothing can reclaim them. The staircase climbs until it meets a ceiling, and the ceiling decides how the story ends — swap death on a bare host, an OOM kill under a cgroup limit.
Two cautions before you call it. First, sample under comparable load: RSS climbing during a traffic ramp is not a leak, it is a process doing its job, and many runtimes never return freed heap to the OS even after load drops, so a high-water mark that holds steady is a plateau too. Second, give it time. A sawtooth with a long GC period can look like a staircase over ten minutes; over an hour it shows its teeth. The decision: a plateau larger than you can afford is a sizing or tuning conversation. A staircase is a leak, and your job shifts from diagnosis to evidence collection — covered in the endings below.
Step 5 — has the OOM killer already been here?
Sometimes you arrive after the verdict. If memory ran out before you got there, the kernel chose a process and killed it, and it wrote down exactly what it did and why. Always check the kernel log; a service that "mysteriously restarted" last night is very often an OOM kill nobody read.
$ journalctl -k --since yesterday | grep -iB1 -A4 'out of memory' Jun 07 03:12:44 api-3 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/api.service, task=java,pid=29217,uid=1001 Jun 07 03:12:44 api-3 kernel: Out of memory: Killed process 29217 (java) total-vm:19075184kB, anon-rss:13727336kB, file-rss:2048kB, shmem-rss:0kB, UID:1001 pgtables:28412kB oom_score_adj:0
(On a box without journald, dmesg -T | grep -i oom reads the same ring
buffer.) Decode the kill line and you get a small autopsy. Killed process
29217 (java) names the victim. anon-rss:13727336kB is the part
that matters: 13.7 GB of anonymous memory — heap and stacks, private to the
process, not backed by any file — which on a 15 GB machine is the whole story.
total-vm is virtual size, big and boring as usual.
file-rss near zero says this was not about mapped files. Above the kill
line the kernel also dumps a table of every candidate process with its
oom_score; the victim is normally the top scorer, and the score is
essentially "fraction of available memory this process is using," nudged by
oom_score_adj.
That adjustment knob runs from -1000 to +1000 and is worth knowing from both sides. A value of -1000 makes a process untouchable (sshd often ships with protection like this, so the box stays reachable after a kill); positive values volunteer a process to die first. Two operational notes. If your critical service keeps being the victim, resist the urge to set -1000 and walk away: the memory is still gone, and the killer will simply take the second-place process, which is usually something you wanted alive too. And note which process was killed versus which was guilty: the killer targets the biggest eligible consumer at the moment of crisis, which is usually but not always the process whose growth caused the crisis. A slow leak in a small daemon can push the kernel to execute your perfectly innocent database. The full guide to reading the kernel's logs is at journalctl & dmesg.
The decision this forces: a kill record turns the investigation from "is there a problem" into "name the grower." Take the victim's identity with suspicion, find what was growing before the kill (your metrics history, or the RSS watch from step 4 on the restarted process), and check the next step, because the constraint line in that log tells you whether the wall was the host or a cgroup.
Step 6 — in a container, the host numbers are scenery
Everything above assumed the process can use the machine. Inside a container it
cannot: the cgroup memory limit is the real ceiling, and a process can be OOM-killed
with the host showing 40 GB available. free inside the container
happily reports the host's memory (it reads /proc/meminfo,
which is not namespaced), which has confused every engineer at least once. Ask the
cgroup instead:
$ cat /sys/fs/cgroup/memory.max /sys/fs/cgroup/memory.current 2147483648 2089247232 $ grep -E 'oom_kill|anon ' /sys/fs/cgroup/memory.events /sys/fs/cgroup/memory.stat /sys/fs/cgroup/memory.events:oom_kill 3 /sys/fs/cgroup/memory.stat:anon 1916221440
Read from inside the container, /sys/fs/cgroup/memory.max is the limit
(2 GiB here; the file says max if there is none) and
memory.current is usage as the kernel charges it — 97% of the way to the
wall. Two subtleties hide in that charge. It includes the page cache for files the
container touches, so a container that reads a lot can sit near its limit while
perfectly healthy, because that cache is reclaimable; memory.stat's
anon line tells you how much is the un-reclaimable kind. And
oom_kill 3 in memory.events means the killer has fired
three times inside this cgroup — kills that never show up as host-level
memory pressure at all.
On Kubernetes the same story wears different clothes. The pod's limit becomes
memory.max, and when a container crosses it the kernel kills the
biggest process inside, the container exits, and the kubelet records the tell:
$ kubectl get pod api-6d9f7b4c8-x2vlp -o jsonpath='{.status.containerStatuses[0].lastState}' {"terminated":{"exitCode":137,"reason":"OOMKilled","startedAt":"2026-06-07T03:01:10Z", "finishedAt":"2026-06-07T03:12:44Z"}}
OOMKilled with exit code 137 (128 + SIGKILL's 9) is the canonical
signature. The decision this step forces is a fork: if usage keeps hitting a limit
and the growth is a plateau — the process simply needs more than it was given — the
fix is the limit, and that is a capacity decision, not a bug hunt. If the growth is a
staircase, the limit did its job: it converted a slow leak into a fast, loud,
automatically-restarted failure, and you are back to evidence collection. Either way,
when a containerised service dies of memory, check the cgroup numbers before the host
ones; the host was never the ceiling.
The four endings
Every investigation that starts with "memory is full" ends in one of four places. Knowing the list keeps you from stopping early at a plausible-but-wrong one.
1. The false alarm. Available was fine; the dashboard was plotting used-over-total and counting the page cache as consumption. The fix is the alert expression, not the machine: alert on available (or better, on sustained swap-in and reclaim activity), and write down why so the next on-call does not re-litigate it. This is the most common ending by a wide margin.
2. The too-small limit. The process is healthy — sawtooth, stable
plateau — but the plateau sits above the cgroup limit or too close to the host's
capacity. The evidence is a plateau in the RSS watch plus repeated
oom_kill events at a consistent usage level. The fix is sizing: raise
the limit, shrink the heap flags, or move the workload. Treating this as a leak hunt
wastes days, because there is no leak to find.
3. The actual leak. The staircase. Before you restart anything,
capture the evidence a fix will need, because the evidence dies with the process:
save /proc/PID/smaps_rollup now and again thirty minutes from now, plus
full smaps if you can afford the disk, so whoever debugs it can see
which mappings grew — anonymous heap growth points at the application,
growth in one mapped region points at a specific arena or cache. Note the workload
during the window. Then restart on your schedule rather than the OOM killer's, and
hand the smaps diff to the owning team. A restart is a payment plan, not a payoff.
4. Kernel and friends. Occasionally no process accounts for the
pressure: the per-process numbers sum to far less than "used" minus cache. The memory
is then in the kernel or in shared segments — slab caches (dentries and inodes from a
job that touches millions of files are the classic), tmpfs filling up (it counts as
shared in free), or huge pages reserved and unused. slabtop
ranks the slab caches, cat /proc/meminfo has the supporting lines
(Slab, SReclaimable, Shmem,
HugePages_Total), and df -h /dev/shm /tmp catches tmpfs.
Rare, and worth checking precisely because nobody does. The kernel-side machinery
behind all of these counters is covered in
memory
management.
The investigation, end to end
Here is the whole method compressed into one realistic incident, the way it actually reads in a terminal. The page at 03:14: "api-3 memory 96%, api pods restarting."
$ free -h total used free shared buff/cache available Mem: 15Gi 13Gi 201Mi 82Mi 1.7Gi 1.2Gi Swap: 2.0Gi 1.7Gi 310Mi # available is 8% of total and swap is nearly full — real pressure. continue. $ journalctl -k --since -2h | grep -ic 'out of memory' 2 # the killer has fired twice tonight. the kill lines name java, anon-rss ~13.5G. $ ps aux --sort=-%mem | head -3 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND app 30992 9.8 62.1 19077232 9912204 ? Ssl 03:13 14:02 java -Xmx5g -jar api.jar postgres 2204 1.0 9.0 2167040 1438112 ? Ss May14 88:10 postgres: checkpointer # restarted 90 minutes ago, already at 9.9G RSS with a 5G heap flag. # the heap cap is 5G but RSS is double that: the growth is off-heap. $ cat /proc/30992/smaps_rollup | grep -E 'Rss|Pss|Private_Dirty|Anonymous' Rss: 9912204 kB Pss: 9874551 kB Private_Dirty: 9821080 kB Anonymous: 9806332 kB # ~all private, ~all anonymous. nothing shared, nothing file-backed. it's heap-like # memory the JVM allocated outside the Java heap (native buffers, most likely). $ while sleep 60; do ps -o rss= -p 30992; done 9912204 9968112 10024836 10081172 10138996 # ~55MB/minute, monotonic, load is flat. staircase. it's a leak.
Verdict in five commands: a native-memory leak in the api service, roughly
55 MB a minute, off-heap so no JVM flag will cap it, killed by the kernel twice
already tonight. Evidence captured: two smaps_rollup snapshots thirty
minutes apart attached to the ticket. Mitigation: scheduled restart every four hours
until the fix lands, which keeps the staircase from reaching the ceiling. Total time,
maybe twelve minutes — and none of it spent staring at a percent gauge.
free -h (is available actually low?) →
journalctl -k | grep -i 'out of memory' (has the verdict already been
delivered?) → ps aux --sort=-%mem | head (who?) →
cat /proc/PID/smaps_rollup (how much is really theirs?) →
cat /sys/fs/cgroup/memory.max inside the container (what is the real
ceiling?). Five commands, and you exit at the first one more often than not.What to write in the incident notes
Memory incidents recur, and the second investigation is only faster than the first if
the first wrote things down. The numbers worth recording, because they all evaporate:
the free -h output at the worst moment (available and swap especially);
the top of the ps ranking with actual RSS values, not screenshots of a
dashboard; the full OOM kill line from the kernel log, verbatim, since
anon-rss and oom_score_adj in that one line settle half the
follow-up questions; the cgroup limit and memory.events counters if a
container was involved; and at least two smaps_rollup snapshots with
timestamps if you suspected a leak.
Then write the shape, in words: "RSS grew ~55 MB/min under flat load, floor rising after each GC" is a sentence an engineer can act on six weeks later; "memory was at 96%" is not. And record the ending you reached — false alarm, under-provisioned, leak, or kernel-side — with the one piece of evidence that proved it. The most useful incident note is the one that lets the next person skip to step 4 because steps 1 through 3 were already settled in writing.
Further reading
- proc_pid_smaps(5) — the definitions of Rss, Pss, and the private/shared split, straight from the source.
- cgroup v2 — memory controller — memory.max, memory.current, memory.events, and exactly when the in-cgroup OOM killer fires.
- linuxatemyram.com — the canonical one-page rebuttal to "Linux ate my RAM", good enough to paste into an incident channel.
- Semicolony — free & vmstat — the full decoder for the triage step this investigation begins with.