02 / 08
Errors / 02

OOMKilled

The kernel’s OOM killer shot your container for exceeding its cgroup memory limit. The node can have gigabytes free and this still happens — the limit is the whole world as far as the cgroup is concerned.


The symptom

A container terminates with Reason: OOMKilled and exit code 137 (128 + 9, SIGKILL). On the node, the kernel log has a matching entry naming the process it killed and the cgroup whose limit was hit.

$ kubectl describe pod worker-6f7d8b9c4-tq2nn
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137          ← 128 + 9: destroyed by SIGKILL

$ dmesg -T | grep -i -A1 "out of memory"      # on the node
[Tue Jun  9 11:02:17] Memory cgroup out of memory: Killed process 31744 (worker)
                      total-vm:3182040kB, anon-rss:2092912kB, file-rss:1024kB
  ← "Memory cgroup out of memory" = the container’s limit, not the node’s RAM
  ← anon-rss ≈ 2.0Gi against a 2Gi limit: the working set simply didn’t fit

The diagnosis

1 Confirm it was the cgroup limit, not node pressure

$ dmesg -T | grep -iE "memory cgroup|out of memory" | tail -3
[11:02:17] Memory cgroup out of memory: Killed process 31744 (worker) ...
  ← "Memory cgroup out of memory"  → the container blew its own limit
  ← "Out of memory: Killed process" → the NODE ran out: different problem

$ kubectl get pod worker-6f7d8b9c4-tq2nn -o jsonpath="{.status.containerStatuses[0].lastState}"
map[terminated:map[exitCode:137 reason:OOMKilled ...]]

Two different machines of death. A cgroup OOM means your container exceeded its own memory limit — the fix is in your pod spec or your process. Node-level memory pressure instead triggers kubelet eviction (status Evicted, no 137) or, if it gets ahead of the kubelet, a node-wide OOM kill. If you see Evicted pods rather than OOMKilled, you’re in eviction territory — a different mechanism with different knobs.

2 Compare actual usage to the limit

$ kubectl get pod worker-6f7d8b9c4-tq2nn -o jsonpath="{.spec.containers[0].resources}"
{"limits":{"memory":"2Gi"},"requests":{"memory":"1Gi"}}

$ kubectl top pod worker-6f7d8b9c4-tq2nn --containers
POD                       NAME     CPU(cores)   MEMORY(bytes)
worker-6f7d8b9c4-tq2nn    worker   210m         1937Mi
  ← cruising at 1.9Gi against a 2Gi limit: any burst is fatal

The shape of usage over time is the diagnosis. Steady state parked just under the limit: the limit is simply too small for the real working set (cause 1). A monotonic ramp that climbs regardless of traffic: a leak (cause 3). A flat line with sudden steps at traffic spikes: burst allocations (cause 4). Pull the memory graph from your metrics stack before changing anything — the one number kubectl top shows is an instant, not a story.

3 Check what the runtime thinks its budget is

$ kubectl exec worker-6f7d8b9c4-tq2nn -- sh -c "cat /sys/fs/cgroup/memory.max /sys/fs/cgroup/memory.current"
2147483648      ← the limit the kernel will enforce (2Gi)
2079346688      ← current usage, page cache included

$ kubectl exec worker-6f7d8b9c4-tq2nn -- java -XX:+PrintFlagsFinal -version | grep MaxHeapSize
   size_t MaxHeapSize = 1073741824    ← JVM sized heap at 1Gi: fine
  (a heap sized off the NODE’s RAM instead would dwarf the 2Gi limit)

Managed runtimes size themselves at startup, and if they size off the wrong number they are doomed before the first request. Modern JVMs read the cgroup limit (container support is on by default; steer with -XX:MaxRAMPercentage). Go reads nothing unless you set GOMEMLIMIT. Node.js needs --max-old-space-size. If heap + off-heap + runtime overhead adds up to more than the limit, the kill is a scheduled event.

The causes, ranked

  1. 1 The limit was a guess, and the real working set is bigger

    confirm Usage plateaus just under the limit and the kill fires during normal operation, not during spikes.

  2. 2 The runtime sized its heap off the node, not the cgroup

    confirm Heap configuration (e.g. JVM MaxHeapSize, Node old-space) alone approaches or exceeds the container limit; kills happen at the same heap fill level every time.

  3. 3 A memory leak

    confirm Usage ramps monotonically regardless of load and resets to baseline on every restart — the restart interval is your leak rate.

  4. 4 Burst allocation outruns reclaim

    confirm Kills line up with traffic spikes or specific request shapes (big uploads, exports, fan-out reads); steady-state usage is comfortable.

The fixes

The limit was a guess, and the real working set is bigger

Measure peak memory over a representative day, set the limit 30–50% above it, and revisit after the next traffic change. If the pod must not be the first to die under node pressure, set requests equal to limits — that’s the Guaranteed QoS class, the last to be evicted.

The runtime sized its heap off the node, not the cgroup

JVM: rely on container support and set -XX:MaxRAMPercentage (leave room for metaspace, threads, and direct buffers — 50–75% is the usual band). Go: set GOMEMLIMIT a little under the cgroup limit. Node: --max-old-space-size. Rule of thumb: heap is not the whole process.

A memory leak

Profile it: pprof heap profiles for Go, heap dumps for the JVM, tracemalloc for Python. Raising the limit only reschedules the kill. If you need breathing room while you hunt, raise the limit and add an alert on slope, not on level.

Burst allocation outruns reclaim

Stream instead of buffering whole payloads, cap request body sizes and per-request concurrency, and bound queue depths. The kernel reclaims page cache under pressure, but a fast anonymous-memory spike can outrun reclaim and trigger the kill anyway.

What people get wrong

  • The OOM killer shoots a process, not "the container". It picks the highest-scoring process in the cgroup. If that’s a child worker rather than PID 1, your main process may survive with a dead child and limp along half-broken — sometimes without the pod ever showing OOMKilled. When a pod is acting weird under memory pressure, check dmesg on the node even if Kubernetes says nothing.
  • "But the node had plenty of free memory". Irrelevant by design. The cgroup limit is a hard ceiling for the container regardless of what the node has spare. That isolation is the point — one tenant’s burst can’t eat the box.
  • Exit 137 alone doesn’t prove OOM. 137 means SIGKILL, and the OOM killer is only its most famous sender. Stop-timeout escalations and external kills produce the same code. Confirm with the OOMKilled reason or the dmesg line before buying RAM — the exit-code-137 page walks the other senders.

Quick answers

Why was my pod OOMKilled when the node had free memory?

Because the kill came from the container’s own cgroup memory limit, not from the node. The kernel enforces the limit per-container: once the container’s usage hits memory.max, the OOM killer fires inside that cgroup no matter how much RAM the node has spare.

How do I find out what was using the memory?

Start with the dmesg line on the node — it names the killed process and its anon-rss. Then look at the usage curve in your metrics: a plateau near the limit means undersizing, a steady ramp means a leak (profile with pprof or heap dumps), and spikes that match traffic mean burst allocations.

Should memory requests equal limits?

For anything you can’t afford to lose under node pressure, yes — requests == limits gives the pod Guaranteed QoS, which the kubelet evicts last. The cost is scheduling density: you reserve the full limit even when idle. For batch and best-effort workloads, a gap is fine.

Related on Semicolony

Found this useful?