14 / 28

Linux / 14

What's eating my memory?

The dashboard says 94% used. Someone has already typed "memory leak" in the incident channel. Maybe. Or maybe the kernel is doing exactly what it should and the only thing leaking is confidence in the graph. Memory complaints are the most over-reported and under-diagnosed symptom in Linux operations, because the obvious numbers — "used", RSS, the percent gauge — all lie in well-documented ways. This page walks the investigation step by step: each stop is one command, the output you will actually see, and the decision that output forces. By the end you have one of four named causes instead of a vibe.

Step 1 — is it actually full?

Every memory investigation starts with the same triage question, and most of them end there too: is the machine under memory pressure, or is the page cache doing its job? Linux treats free RAM as wasted RAM. Any page not needed by a process gets used to cache file contents, because a cached read is thousands of times cheaper than a disk read, and the kernel can hand those pages back the moment a process asks for them. A healthy server runs near 100% "used" by design. So the first command is free -h, and the discipline is to ignore the column everyone stares at:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       9.8Gi       442Mi       210Mi       5.4Gi       5.1Gi
Swap:          2.0Gi       128Mi       1.9Gi

Read it right to left. available is the kernel's own estimate of how much memory could be handed to a new allocation without swapping: free pages plus the part of the cache and other reclaimable memory it can drop on demand. That is the number that decides whether this investigation continues. buff/cache is the page cache plus kernel buffers — the 5.4 GiB here is not "eaten", it is on loan. free at 442 MiB looks alarming and means almost nothing; on a long-running box, free is always small because the kernel put every idle page to work. The dashboard that paged you is probably plotting used / total, which counts the loan as spent.

The decision this forces: if available is comfortable — say a third of total, as here — there is no memory problem, and you should be asking why the alert fired, not what is eating RAM. If available is genuinely small (a few percent of total) or swap usage is climbing, the pressure is real and you move to step 2. The full anatomy of this output, including what vmstat adds when you want to watch pressure move rather than glance at it, is on free & vmstat. One quick corroborating signal worth knowing: vmstat 1 with sustained non-zero si/so columns means the kernel is actively swapping, and that is pressure no matter what any gauge says.

The whole investigation as a tree. Most reports exit at the first branch; the ones that do not deserve every step below.

Step 2 — who is holding it?

Real pressure confirmed, so now the question becomes attribution. The blunt instrument is good enough to start: rank every process by resident set size, the amount of physical RAM each one currently has mapped.

$ ps aux --sort=-%mem | head -6
USER       PID %CPU %MEM      VSZ     RSS TTY  STAT START   TIME COMMAND
app      31415 12.4 41.2 19075184 6571236 ?    Ssl  Jun02 412:11 java -Xmx5g -jar api.jar
postgres  2204  1.1  9.0  2167040 1438112 ?    Ss   May14  88:02 postgres: checkpointer
app      31488  0.8  3.1  1204416  498332 ?    Sl   Jun02  21:40 node worker.js
app      31489  0.7  3.1  1199288  496104 ?    Sl   Jun02  20:55 node worker.js
app      31490  0.7  3.0  1198472  494817 ?    Sl   Jun02  21:02 node worker.js

The same view, live: run top, press M (capital), and the table re-sorts by memory and stays sorted while you watch. Two columns matter and one is a trap. RSS is resident physical memory in kilobytes — real pages in real RAM right now. VSZ is virtual address space, and it is the trap: it counts every mapping the process has ever asked for, including reserved but untouched regions, mapped files, and arenas the allocator grabbed speculatively. A JVM with a 19 GB VSZ on a 15 GB machine is normal; a 19 GB RSS would be impossible. Sort by RSS, reason about RSS, and treat VSZ as trivia unless you are debugging address-space exhaustion specifically.

But RSS has its own lie, and on this box it is sitting in rows three through five. Those node workers each report ~490 MB resident, so the table suggests they cost 1.5 GB together. They do not. RSS counts every page resident in the process's address space, including pages shared with other processes: the node binary, libc, every shared library, and any memory the workers inherited from a common parent and have not written to since the fork. Each shared page gets counted once per process that maps it. Sum the RSS column on a machine running a prefork server with thirty workers and you can "account for" more RAM than the machine has. The decision this step forces: if one private-heap process dominates (the JVM here, at 6.4 GB against its 5 GB heap flag — interesting), RSS is honest enough and you can move on. If the suspects are a fleet of sibling workers, you need the per-process truth, which is the next step.

Step 3 — RSS, PSS, USS: the per-process truth

The kernel keeps the honest ledger in /proc/PID/smaps, which lists every memory mapping the process has with a dozen counters each. Reading it raw is for masochists; the kernel pre-sums it for you in smaps_rollup:

$ cat /proc/31488/smaps_rollup
55d8e2a00000-7ffec1b9d000 ---p 00000000 00:00 0    [rollup]
Rss:              498332 kB
Pss:              361409 kB
Shared_Clean:     189204 kB
Shared_Dirty:      12080 kB
Private_Clean:      8112 kB
Private_Dirty:    288936 kB
Referenced:       441208 kB
Anonymous:        297048 kB
Swap:                  0 kB
SwapPss:               0 kB

Three numbers, three different questions answered. RSS (498 MB) is what ps showed: every resident page, shared ones counted in full. PSS, proportional set size (361 MB), splits each shared page's cost evenly among the processes mapping it — a page shared by four processes charges each one a quarter. PSS columns are additive: sum PSS across all processes and you get a true total that cannot exceed physical RAM, which makes it the right number for "how much do these thirty workers cost together?" USS, unique set size, is Private_Clean + Private_Dirty (297 MB here): pages this process alone holds, the amount of RAM that would be freed if you killed it. USS is the right number for "what do I get back if this process goes away?" and it is also the most honest leak metric, because leaked allocations are private by nature.

On this box the arithmetic settles the worker question: three workers at ~490 MB RSS look like 1.5 GB, but at ~300 MB USS each plus one shared copy of the runtime, the real bill is closer to 1 GB. Not the problem. The JVM with 6.4 GB of almost entirely private, dirty memory still is. The general rule: RSS to rank suspects fast, PSS to budget a fleet, USS to size the refund. The /proc filesystem these files live in gets a full tour at /proc, and the machinery behind these counters — pages, mappings, copy-on-write after fork — is the subject of virtual memory.

The decision this step forces: you now have one named suspect with a real, private footprint. The next question is not "how big is it" but "which way is it moving."

Step 4 — leak or plateau? Watch the shape

A single RSS reading cannot distinguish a leak from a process that is merely large. Leaks are a verb, not a noun: memory that grows without bound under steady load. So you sample over time. No tooling required beyond a loop:

$ while sleep 60; do echo "$(date +%H:%M) $(ps -o rss= -p 31415)"; done
14:02 6571236
14:03 6588412
14:04 6601780
14:05 6615924
14:06 6633108
…ten minutes pass…
14:16 6772040

Then read the shape, not the values. Processes with managed heaps (JVM, Go, Node, Python) breathe: RSS climbs as garbage accumulates, drops when a collection runs, climbs again. Plotted, that is a sawtooth, and a sawtooth oscillating around a flat plateau is a healthy process, however large the plateau is. A leak looks different: a staircase. Each step up is permanent; collections still run, but the floor after each one is higher than the floor before, because the leaked objects are still referenced (or were allocated outside the managed heap entirely) and nothing can reclaim them. The staircase climbs until it meets a ceiling, and the ceiling decides how the story ends — swap death on a bare host, an OOM kill under a cgroup limit.

The two shapes that matter. Sample every minute for thirty minutes under comparable load before declaring which one you are looking at.

Two cautions before you call it. First, sample under comparable load: RSS climbing during a traffic ramp is not a leak, it is a process doing its job, and many runtimes never return freed heap to the OS even after load drops, so a high-water mark that holds steady is a plateau too. Second, give it time. A sawtooth with a long GC period can look like a staircase over ten minutes; over an hour it shows its teeth. The decision: a plateau larger than you can afford is a sizing or tuning conversation. A staircase is a leak, and your job shifts from diagnosis to evidence collection — covered in the endings below.

Step 5 — has the OOM killer already been here?

Sometimes you arrive after the verdict. If memory ran out before you got there, the kernel chose a process and killed it, and it wrote down exactly what it did and why. Always check the kernel log; a service that "mysteriously restarted" last night is very often an OOM kill nobody read.

$ journalctl -k --since yesterday | grep -iB1 -A4 'out of memory'
Jun 07 03:12:44 api-3 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
        cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/api.service,
        task=java,pid=29217,uid=1001
Jun 07 03:12:44 api-3 kernel: Out of memory: Killed process 29217 (java)
        total-vm:19075184kB, anon-rss:13727336kB, file-rss:2048kB, shmem-rss:0kB,
        UID:1001 pgtables:28412kB oom_score_adj:0

(On a box without journald, dmesg -T | grep -i oom reads the same ring buffer.) Decode the kill line and you get a small autopsy. Killed process 29217 (java) names the victim. anon-rss:13727336kB is the part that matters: 13.7 GB of anonymous memory — heap and stacks, private to the process, not backed by any file — which on a 15 GB machine is the whole story. total-vm is virtual size, big and boring as usual. file-rss near zero says this was not about mapped files. Above the kill line the kernel also dumps a table of every candidate process with its oom_score; the victim is normally the top scorer, and the score is essentially "fraction of available memory this process is using," nudged by oom_score_adj.

That adjustment knob runs from -1000 to +1000 and is worth knowing from both sides. A value of -1000 makes a process untouchable (sshd often ships with protection like this, so the box stays reachable after a kill); positive values volunteer a process to die first. Two operational notes. If your critical service keeps being the victim, resist the urge to set -1000 and walk away: the memory is still gone, and the killer will simply take the second-place process, which is usually something you wanted alive too. And note which process was killed versus which was guilty: the killer targets the biggest eligible consumer at the moment of crisis, which is usually but not always the process whose growth caused the crisis. A slow leak in a small daemon can push the kernel to execute your perfectly innocent database. The full guide to reading the kernel's logs is at journalctl & dmesg.

The decision this forces: a kill record turns the investigation from "is there a problem" into "name the grower." Take the victim's identity with suspicion, find what was growing before the kill (your metrics history, or the RSS watch from step 4 on the restarted process), and check the next step, because the constraint line in that log tells you whether the wall was the host or a cgroup.

Step 6 — in a container, the host numbers are scenery

Everything above assumed the process can use the machine. Inside a container it cannot: the cgroup memory limit is the real ceiling, and a process can be OOM-killed with the host showing 40 GB available. free inside the container happily reports the host's memory (it reads /proc/meminfo, which is not namespaced), which has confused every engineer at least once. Ask the cgroup instead:

$ cat /sys/fs/cgroup/memory.max /sys/fs/cgroup/memory.current
2147483648
2089247232
$ grep -E 'oom_kill|anon ' /sys/fs/cgroup/memory.events /sys/fs/cgroup/memory.stat
/sys/fs/cgroup/memory.events:oom_kill 3
/sys/fs/cgroup/memory.stat:anon 1916221440

Read from inside the container, /sys/fs/cgroup/memory.max is the limit (2 GiB here; the file says max if there is none) and memory.current is usage as the kernel charges it — 97% of the way to the wall. Two subtleties hide in that charge. It includes the page cache for files the container touches, so a container that reads a lot can sit near its limit while perfectly healthy, because that cache is reclaimable; memory.stat's anon line tells you how much is the un-reclaimable kind. And oom_kill 3 in memory.events means the killer has fired three times inside this cgroup — kills that never show up as host-level memory pressure at all.

On Kubernetes the same story wears different clothes. The pod's limit becomes memory.max, and when a container crosses it the kernel kills the biggest process inside, the container exits, and the kubelet records the tell:

$ kubectl get pod api-6d9f7b4c8-x2vlp -o jsonpath='{.status.containerStatuses[0].lastState}'
{"terminated":{"exitCode":137,"reason":"OOMKilled","startedAt":"2026-06-07T03:01:10Z",
"finishedAt":"2026-06-07T03:12:44Z"}}

OOMKilled with exit code 137 (128 + SIGKILL's 9) is the canonical signature. The decision this step forces is a fork: if usage keeps hitting a limit and the growth is a plateau — the process simply needs more than it was given — the fix is the limit, and that is a capacity decision, not a bug hunt. If the growth is a staircase, the limit did its job: it converted a slow leak into a fast, loud, automatically-restarted failure, and you are back to evidence collection. Either way, when a containerised service dies of memory, check the cgroup numbers before the host ones; the host was never the ceiling.

The four endings

Every investigation that starts with "memory is full" ends in one of four places. Knowing the list keeps you from stopping early at a plausible-but-wrong one.

1. The false alarm. Available was fine; the dashboard was plotting used-over-total and counting the page cache as consumption. The fix is the alert expression, not the machine: alert on available (or better, on sustained swap-in and reclaim activity), and write down why so the next on-call does not re-litigate it. This is the most common ending by a wide margin.

2. The too-small limit. The process is healthy — sawtooth, stable plateau — but the plateau sits above the cgroup limit or too close to the host's capacity. The evidence is a plateau in the RSS watch plus repeated oom_kill events at a consistent usage level. The fix is sizing: raise the limit, shrink the heap flags, or move the workload. Treating this as a leak hunt wastes days, because there is no leak to find.

3. The actual leak. The staircase. Before you restart anything, capture the evidence a fix will need, because the evidence dies with the process: save /proc/PID/smaps_rollup now and again thirty minutes from now, plus full smaps if you can afford the disk, so whoever debugs it can see which mappings grew — anonymous heap growth points at the application, growth in one mapped region points at a specific arena or cache. Note the workload during the window. Then restart on your schedule rather than the OOM killer's, and hand the smaps diff to the owning team. A restart is a payment plan, not a payoff.

4. Kernel and friends. Occasionally no process accounts for the pressure: the per-process numbers sum to far less than "used" minus cache. The memory is then in the kernel or in shared segments — slab caches (dentries and inodes from a job that touches millions of files are the classic), tmpfs filling up (it counts as shared in free), or huge pages reserved and unused. slabtop ranks the slab caches, cat /proc/meminfo has the supporting lines (Slab, SReclaimable, Shmem, HugePages_Total), and df -h /dev/shm /tmp catches tmpfs. Rare, and worth checking precisely because nobody does. The kernel-side machinery behind all of these counters is covered in memory management.

The investigation, end to end

Here is the whole method compressed into one realistic incident, the way it actually reads in a terminal. The page at 03:14: "api-3 memory 96%, api pods restarting."

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi        13Gi       201Mi        82Mi       1.7Gi       1.2Gi
Swap:          2.0Gi       1.7Gi       310Mi
# available is 8% of total and swap is nearly full — real pressure. continue.

$ journalctl -k --since -2h | grep -ic 'out of memory'
2
# the killer has fired twice tonight. the kill lines name java, anon-rss ~13.5G.

$ ps aux --sort=-%mem | head -3
USER       PID %CPU %MEM      VSZ      RSS TTY STAT START  TIME COMMAND
app      30992  9.8 62.1 19077232  9912204 ?   Ssl  03:13 14:02 java -Xmx5g -jar api.jar
postgres  2204  1.0  9.0  2167040  1438112 ?   Ss   May14 88:10 postgres: checkpointer
# restarted 90 minutes ago, already at 9.9G RSS with a 5G heap flag.
# the heap cap is 5G but RSS is double that: the growth is off-heap.

$ cat /proc/30992/smaps_rollup | grep -E 'Rss|Pss|Private_Dirty|Anonymous'
Rss:             9912204 kB
Pss:             9874551 kB
Private_Dirty:   9821080 kB
Anonymous:       9806332 kB
# ~all private, ~all anonymous. nothing shared, nothing file-backed. it's heap-like
# memory the JVM allocated outside the Java heap (native buffers, most likely).

$ while sleep 60; do ps -o rss= -p 30992; done
9912204   9968112   10024836   10081172   10138996
# ~55MB/minute, monotonic, load is flat. staircase. it's a leak.

Verdict in five commands: a native-memory leak in the api service, roughly 55 MB a minute, off-heap so no JVM flag will cap it, killed by the kernel twice already tonight. Evidence captured: two smaps_rollup snapshots thirty minutes apart attached to the ticket. Mitigation: scheduled restart every four hours until the fix lands, which keeps the staircase from reaching the ceiling. Total time, maybe twelve minutes — and none of it spent staring at a percent gauge.

The fast version. When you have five minutes, not fifty: free -h (is available actually low?) → journalctl -k | grep -i 'out of memory' (has the verdict already been delivered?) → ps aux --sort=-%mem | head (who?) → cat /proc/PID/smaps_rollup (how much is really theirs?) → cat /sys/fs/cgroup/memory.max inside the container (what is the real ceiling?). Five commands, and you exit at the first one more often than not.

What to write in the incident notes

Memory incidents recur, and the second investigation is only faster than the first if the first wrote things down. The numbers worth recording, because they all evaporate: the free -h output at the worst moment (available and swap especially); the top of the ps ranking with actual RSS values, not screenshots of a dashboard; the full OOM kill line from the kernel log, verbatim, since anon-rss and oom_score_adj in that one line settle half the follow-up questions; the cgroup limit and memory.events counters if a container was involved; and at least two smaps_rollup snapshots with timestamps if you suspected a leak.

Then write the shape, in words: "RSS grew ~55 MB/min under flat load, floor rising after each GC" is a sentence an engineer can act on six weeks later; "memory was at 96%" is not. And record the ending you reached — false alarm, under-provisioned, leak, or kernel-side — with the one piece of evidence that proved it. The most useful incident note is the one that lets the next person skip to step 4 because steps 1 through 3 were already settled in writing.

What's eating my memory?

Step 1 — is it actually full?

Step 2 — who is holding it?

Step 3 — RSS, PSS, USS: the per-process truth

Step 4 — leak or plateau? Watch the shape

Step 5 — has the OOM killer already been here?

Step 6 — in a container, the host numbers are scenery

The four endings

The investigation, end to end

What to write in the incident notes

Further reading

15 — What's holding this port?