06 / 28

Linux / 06

/proc — the kernel's front door

Every diagnostic tool you have met on this track — top, ps, free, lsof — is a pretty-printer. None of them has a private line to the kernel. They all read the same place you can read with cat: a fake filesystem mounted at /proc, where every file is rendered fresh, on demand, the instant you open it. Learn five files under /proc/PID/ and you can answer most of what those tools answer with no tools installed at all — and a few things none of them will tell you.

Where the numbers actually come from

Run strace against ps sometime and watch what it does. It does not make a mysterious syscall that returns a process table. It opens /proc/1/stat, reads it, closes it, opens /proc/2/stat, and so on for every numbered directory it finds. free opens /proc/meminfo and does arithmetic. uptime opens /proc/loadavg and /proc/uptime. lsof walks /proc/*/fd. The whole observability toolbox is a set of opinions about how to format the contents of one directory.

That directory is strange the moment you look closely. ls -l /proc/meminfo reports a size of zero bytes, yet cat produces fifty lines. The modification times are always now. There is no disk behind it, no blocks, no journal. /proc is a filesystem in interface only: a tree of names the kernel agrees to answer questions through. When you open() and read() one of these files, the kernel runs a function that gathers the answer at that instant, formats it as text, and hands it to you. Nothing was stored; everything was computed because you asked. Close the file, read it again, and the kernel computes a fresh answer.

This design has a consequence worth internalising early: there is no privileged tool. When top and a shell script disagree about a process's memory, the tiebreaker is to read the file both of them derive from. When a box is so broken that nothing is installed — a minimal container image, a rescue shell, an initramfs — cat and ls against /proc recover most of what the missing tools would have told you. The rest of this page is the map: which files to read, how to decode the two that confuse everyone, and where the floor creaks.

The five files that matter

Each running process gets a directory named after its PID, and inside it the kernel publishes several dozen entries. Most are for specialists. Five of them carry nearly all the diagnostic weight, and they are worth knowing the way you know your own phone number.

Entry	What it holds	When you read it
`status`	The human-readable summary: state, parent PID, UIDs, `VmRSS`, `Threads`, `FDSize`, signal masks	First stop for any "what is this process doing" question
`fd/`	One symlink per open descriptor, pointing at the file, socket, or pipe behind it	Descriptor leaks, "what does this hold open" — this is lsof's raw source
`limits`	The resource limits of the running process: open files, address space, core size	Whenever you suspect a ulimit — your shell's `ulimit -n` is not evidence
`smaps_rollup`	True memory accounting: RSS, PSS, private and shared, summed across all mappings	When "how much memory does it really use" has to survive shared pages
`cmdline`, `environ`, `cwd`, `exe`	Exact argv, the environment at exec time, the working directory, and a link to the binary	"What is this thing, really" — identity questions, imposter processes

Alongside the per-process directories sit the system-wide files. /proc/meminfo is the machine's memory ledger, the one free summarises. /proc/loadavg is the three load averages plus the running/total task count. /proc/stat holds the per-CPU tick counters that top turns into percentages. /proc/net/tcp and its siblings are the socket tables, and /proc/cpuinfo describes the processors. You rarely read these raw when the tools are present, but knowing they exist changes what "the tool is missing" means: it means nothing.

The /proc/PID directory, abridged to the entries you will actually open. Everything here is rendered when read; none of it occupies disk.

Reading /proc/PID/status

status is the file to cat first, every time. It is deliberately human-formatted — one Key: value pair per line — and a dozen of its lines carry real diagnostic signal. Here is an excerpt from a Java service, decoded.

$ cat /proc/41327/status
Name:   java                      # the comm name, truncated to 15 chars
State:  S (sleeping)              # R running, S sleeping, D uninterruptible, Z zombie
Tgid:   41327
Pid:    41327
PPid:   1290                      # the parent — who started this?
Uid:    1003  1003  1003  1003    # real, effective, saved, fs
FDSize: 256                       # fd table slots allocated, not fds in use
VmPeak: 12482040 kB               # largest virtual size ever reached
VmSize: 12433016 kB               # current virtual size — mostly meaningless
VmRSS:   1873204 kB               # pages actually resident in RAM right now
VmSwap:        0 kB               # pages of this process pushed to swap
Threads: 47                       # thread count — each one lives under task/
voluntary_ctxt_switches:    981244
nonvoluntary_ctxt_switches:  41873 # high and climbing = fighting for CPU

A few of these repay a closer look. State is the same letter ps prints: the one to fear is D, uninterruptible sleep, which usually means the process is stuck inside the kernel waiting on I/O and will ignore every signal you send it. PPid answers "who started this" — a service whose parent is PID 1 was either started by init or orphaned and adopted. VmSize is virtual address space, and on modern allocators it is gloriously useless as a memory number: a JVM or a Go runtime will reserve tens of gigabytes it never touches. VmRSS is the number that corresponds to physical RAM, and it is what top shows in the RES column. FDSize trips people up: it is the allocated size of the descriptor table, a power of two, not the count of open descriptors — for the count, run ls /proc/41327/fd | wc -l. And Threads tells you this single PID is actually 47 schedulable threads, which matters in a minute.

Identity questions go to the neighbours. cmdline holds the exact argv the process was started with, so you can see the real flags rather than what the runbook claims. cwd is a symlink to the live working directory — the answer to "why can I not unmount this volume" is often a shell sitting in it. exe is a symlink to the binary itself, with a property so useful it gets its own scenario below. Be aware that cmdline and environ delimit their entries with NUL bytes, not newlines, so a bare cat smashes everything into one unreadable line. Pipe through tr '\0' '\n' and they turn legible.

RSS, PSS, USS — the memory page nobody decodes

"How much memory does this process use" sounds like one question. It is three, and the reason is shared pages. When two processes map the same library, the kernel keeps one physical copy and points both page tables at it — that is most of the point of virtual memory. But it wrecks naive accounting. If each process reports the shared pages as its own, summing the per-process numbers counts those pages twice, and the total comes out larger than the RAM in the machine. Linux resolves this by giving you three numbers with three different rules about shared pages.

RSS charges every resident page to every process that maps it, in full. It is cheap to track and honest about one process in isolation, but RSS values must never be summed. PSS, proportional set size, splits each shared page evenly among its mappers: a page shared by ten processes adds one tenth of a page to each one's PSS. PSS values sum correctly — add the PSS of every process and you get the true total of used memory. USS, unique set size, counts only the pages no one else maps: the memory that would actually be returned to the system if this process exited right now. The kernel does not print the name USS, but it is sitting in smaps_rollup as the two Private lines added together.

$ cat /proc/41327/smaps_rollup
00400000-7ffc8f0c4000 ---p 00000000 00:00 0      [rollup]
Rss:             1873204 kB   # everything resident, shared counted in full
Pss:             1641008 kB   # shared pages divided by their mapper count
Shared_Clean:     244612 kB   # shared, unmodified — mostly library code
Shared_Dirty:        512 kB   # shared and written to
Private_Clean:     12440 kB   # ours alone, unmodified
Private_Dirty:   1615640 kB   # ours alone, written — the heap lives here
Swap:                  0 kB
SwapPss:               0 kB

Read it bottom-up. Private_Clean plus Private_Dirty is the USS: about 1.63 GB that exists only for this process. The Shared lines are the roughly 245 MB of pages this process has in common with others — almost all of it clean library text. RSS is private plus shared in full, 1.87 GB. PSS lands in between, because the shared 245 MB gets divided by however many processes map each page; here it contributes only about 13 MB, which tells you those libraries are shared widely. The spread between the three numbers is itself information: a fleet of worker processes forked from one parent can show a huge combined RSS while the PSS total stays modest, because they are all reading the same copy-on-write pages.

Two processes, 600 MB private each, sharing one 200 MB library. Three accounting rules, three answers — and only the PSS column survives addition.

A practical rule of thumb: use RSS to watch one process over time, PSS when you need to add processes together (capacity planning, "which team's services fill this box"), and USS when deciding what killing a process would actually buy you. smaps_rollup exists because the older path — reading /proc/PID/smaps, which prints these counters for every individual mapping and runs to thousands of lines — was painfully slow to parse. The rollup makes the per-process totals one cheap-ish read. Cheap-ish, not free; the pitfalls section explains why.

Three production scenarios

"Too many open files" — but whose limits?

A service starts failing with EMFILE. The first reflex is to check the limit, so someone SSHes in, runs ulimit -n, sees 65536, and declares the limits fine. That check proved nothing. ulimit reports the limits of your login shell, which inherited them from sshd and PAM. The failing service was started by systemd, or a container runtime, or an init script from 2019, and inherited a completely different set. Limits are per-process, set at start time, and the only authoritative record is the one the kernel keeps for the process itself.

$ ulimit -n
65536                              # your shell. irrelevant.
$ grep "open files" /proc/41327/limits
Max open files            1024      4096      files
$ ls /proc/41327/fd | wc -l
1019                               # five away from the soft limit

Two reads and the diagnosis is complete: the process runs with a soft limit of 1024, it is holding 1019 descriptors, and the crash is minutes away. Whether those 1019 descriptors are legitimate or a leak is the next question — ls -l /proc/41327/fd shows what each one points at, and the lsof page walks that investigation. The lesson generalises: any time the question is "what constraints does this process run under," read its own limits, environ, and cgroup files instead of guessing from a shell that shares nothing with it but a hostname.

The binary was deleted, the process lives on

A deploy replaced /usr/local/bin/worker, but the old process is still running — and now it is misbehaving in a way the new binary supposedly fixed, and someone wants to disassemble exactly what is executing. Or worse: the binary was deleted by an attacker covering tracks, and the only copy left in the world is the one the kernel is executing. Either way, the file has no name on disk, and either way it does not matter, because deleting a file only removes its name; the inode survives while anything holds it — and a running program holds its own text. /proc/PID/exe is a live handle to that inode.

$ ls -l /proc/8841/exe
lrwxrwxrwx 1 root root 0 Jun  8 10:42 /proc/8841/exe -> /usr/local/bin/worker (deleted)
$ sudo cp /proc/8841/exe /tmp/worker.rescued
$ sha256sum /tmp/worker.rescued
9f86d081884c7d65...  /tmp/worker.rescued

The (deleted) marker confirms the name is gone, but cp through the symlink reads the inode's contents byte for byte. You now have the exact binary that is running, suitable for hashing against your artifact registry, diffing against the new release, or handing to forensics. The same trick recovers a deleted shared library through /proc/PID/map_files/, and it is the reason "delete the malware" is not the same as "stop the malware." The inode-and-link-count machinery this rides on is covered in file systems.

The process is at 400% CPU — which thread?

top shows one process pinned at 400% CPU. A process is not a unit of execution, though; its threads are, and this one has 47 of them. The kernel publishes each thread as a subdirectory of /proc/PID/task/, named by thread ID, with the same files inside — its own stat, its own status, its own comm. Most runtimes name their threads, so the busy ones often identify themselves.

$ ls /proc/41327/task | head -4
41327
41342
41358
41389
$ for t in /proc/41327/task/*; do
    echo "$(cut -d' ' -f1 $t/schedstat)  $(cat $t/comm)"
  done | sort -nr | head -4
7841220047113  GC Thread#0
7790881520446  GC Thread#1
 412904118821  http-worker-3
   1204481950  main

The first field of a thread's schedstat file is its cumulative on-CPU time in nanoseconds (fields 14 and 15 of stat carry the same story in user and kernel ticks); sample it twice a few seconds apart and the deltas tell you which threads are burning. In this excerpt the garbage collector threads dwarf everything else — a memory problem wearing a CPU costume. In practice you would let top -H -p 41327 do the sampling and sorting for you, and the top & htop page covers that view; the point here is that top -H is not doing anything you could not do with a shell loop over task/. When the tool's numbers look wrong, the files are how you check them.

What /proc actually is

It helps to see /proc for what it is underneath: an interface, dressed up as a filesystem because Unix already had superb plumbing for files. The kernel keeps its real bookkeeping in internal structures — the task list, the per-process descriptor tables and memory maps described in processes, the page tables described in virtual memory. Procfs registers a tree of names, and behind each name a handler function. open() on /proc/41327/status does not find data; it finds the handler. read() invokes it, the handler walks the live task structure for PID 41327, formats what it finds as text, and that text is your file contents. The file is the conversation, not a thing that exists between conversations.

This explains every oddity at once. Sizes show as zero because there is nothing to measure until a read happens. Contents change between reads because each read recomputes. Listing /proc is enumerating the task list, which is why directories appear and vanish as processes start and exit. Permissions are real and enforced — most of another user's /proc/PID internals are unreadable without privilege, which is exactly why an unprivileged lsof sees so little. And reads have genuine cost, because "render this answer" can mean real kernel work: a read of smaps_rollup walks the process's page tables to count page references. The filesystem metaphor is so good that it is easy to forget you are calling into the kernel every time, but you are.

The same trick appears elsewhere once you know to look. /sys exposes devices and kernel parameters through the identical render-on-read mechanism, one value per file. /proc/sys/ goes further and accepts writes — echo a number into /proc/sys/vm/swappiness and you have changed a kernel tunable with no tool but the shell. The design lesson is the one Unix keeps teaching: if you expose state as files, every existing tool — cat, grep, watch, a shell loop — becomes a client for free.

Pitfalls

Summing PSS across a busy box is expensive. Every read of smaps_rollup makes the kernel walk that process's page tables, and a loop over two thousand processes is two thousand walks. It will finish, but on a loaded machine it can take seconds of CPU and visibly perturb the thing you are measuring. Do it when you need the true total; do not put it in a one-second metrics loop. (And never parse full smaps when smaps_rollup exists — same data, hundreds of times more text.)

environ and cmdline are NUL-delimited — and environ is a snapshot. The missing newlines are a five-second annoyance fixed with tr '\0' '\n'. The deeper trap is temporal: environ shows the environment as it was at exec time. A process that called setenv() afterwards, or a runtime that mutates its environment, will not be reflected. It answers "what was this started with," not "what does it see now."

PIDs race. Between you reading /proc/41327/status and /proc/41327/fd, the process can exit and the kernel can hand 41327 to a brand-new process. Default PID space is small (32768, see /proc/sys/kernel/pid_max) and wraps fast on busy machines. Scripts that walk /proc must tolerate directories vanishing mid-walk — an ENOENT there is weather, not an error — and anything that needs a stable handle on a process should hold its /proc/PID directory open or use pidfds rather than trusting the number twice.

In a container, /proc tells you about the host. Procfs reports kernel state, and a container shares the host's kernel. So /proc/meminfo inside a container with a 512 MB cgroup limit cheerfully reports the host's 64 GB, and /proc/loadavg reports the load of every tenant on the machine. Generations of runtimes sized their heaps and thread pools off these files and then died confused when the cgroup limit arrived first. The real limits live under /sys/fs/cgroup/, and some platforms mount lxcfs over the offending files to fake container-local views — which means inside a container you cannot even assume the lie is consistent. PID namespaces add a second twist: /proc inside the container lists only the container's processes, renumbered from 1, so the PID you see inside is not the PID the host sees.

Reading is not always harmless. Almost everything in this page is read-only and safe, but /proc has writable corners — /proc/sys/ tunables, /proc/sysrq-trigger — where a stray redirect changes kernel behaviour immediately, no confirmation asked. The habit to build: cat freely, but treat any > aimed inside /proc with the respect you would give a config change in production, because that is what it is.

A drill you can run right now

Everything below is read-only and safe on any Linux machine, shared boxes included. The subject is your own shell, which the variable $$ always names, so no permissions are needed and nothing can be disturbed. Ten minutes, and the five files stop being a list and become places you have been.

Step 1 — meet your shell. Read its summary and pick out the lines you now know:

$ cat /proc/$$/status | head -20
Name:   bash
State:  S (sleeping)
PPid:   2204                      # trace it: that's sshd or your terminal
VmRSS:      5288 kB
Threads:    1
$ tr '\0' '\n' < /proc/$$/environ | head -5
LANG=en_US.UTF-8
HOME=/home/nilesh
SHELL=/bin/bash

Follow the PPid upward — cat /proc/2204/comm — and keep going until you hit PID 1. You have just walked the process tree by hand, which is all pstree does.

Step 2 — descriptors, limits, and place. Look at what your shell holds open and the rules it runs under:

$ ls -l /proc/$$/fd
lrwx------ 1 nilesh nilesh 64 Jun  8 11:02 0 -> /dev/pts/0
lrwx------ 1 nilesh nilesh 64 Jun  8 11:02 1 -> /dev/pts/0
lrwx------ 1 nilesh nilesh 64 Jun  8 11:02 2 -> /dev/pts/0
$ grep "open files" /proc/$$/limits
Max open files            1024      1048576      files
$ ls -l /proc/$$/cwd /proc/$$/exe
lrwxrwxrwx ... /proc/12871/cwd -> /home/nilesh
lrwxrwxrwx ... /proc/12871/exe -> /usr/bin/bash

Descriptors 0, 1, and 2 all point at your terminal device — standard input, output, and error, visible as plain symlinks. Now run exec 3< /etc/hostname, list fd again, and watch descriptor 3 appear; exec 3<&- closes it and it vanishes. You have watched the kernel's descriptor table change in real time, which is the entire mechanism behind lsof.

Step 3 — catch a tool red-handed. Prove that free is a formatter, not a source:

$ head -3 /proc/meminfo
MemTotal:       16384256 kB
MemFree:         1204480 kB
MemAvailable:    9842116 kB
$ free -k | head -2
              total     used      free   shared  buff/cache  available
Mem:       16384256  6021048  1204480   312044     9158728     9842116

Same numbers, to the kilobyte: total is MemTotal, free is MemFree, available is MemAvailable, and the rest is arithmetic over a few more lines of the same file. If you have strace handy, strace -e openat free shows the openat("/proc/meminfo") call outright — the strace page makes a habit of this kind of unmasking. Finish by reading cat /proc/loadavg and comparing it with uptime's output, and the lesson is complete: the tools are conveniences. The files are the truth.

If you remember one path. /proc/PID/status for what a process is, /proc/PID/fd for what it holds, /proc/PID/limits for the rules it actually runs under — and tr '\0' '\n' whenever the output arrives as one long line.

/proc — the kernel's front door

Where the numbers actually come from

The five files that matter

Reading /proc/PID/status

RSS, PSS, USS — the memory page nobody decodes

Three production scenarios

"Too many open files" — but whose limits?

The binary was deleted, the process lives on

The process is at 400% CPU — which thread?

What /proc actually is

Pitfalls

A drill you can run right now

Further reading

07 — kill & signals