/proc — the kernel's front door
Every diagnostic tool you have met on this track — top, ps,
free, lsof — is a pretty-printer. None of them has a private line
to the kernel. They all read the same place you can read with cat: a fake
filesystem mounted at /proc, where every file is rendered fresh, on demand, the
instant you open it. Learn five files under /proc/PID/ and you can answer most
of what those tools answer with no tools installed at all — and a few things none of them
will tell you.
Where the numbers actually come from
Run strace against ps sometime and watch what it does. It does not
make a mysterious syscall that returns a process table. It opens
/proc/1/stat, reads it, closes it, opens /proc/2/stat, and so on for
every numbered directory it finds. free opens /proc/meminfo and does
arithmetic. uptime opens /proc/loadavg and
/proc/uptime. lsof walks
/proc/*/fd. The whole observability toolbox is a set of opinions about how to
format the contents of one directory.
That directory is strange the moment you look closely. ls -l /proc/meminfo reports
a size of zero bytes, yet cat produces fifty lines. The modification times are
always now. There is no disk behind it, no blocks, no journal. /proc is a
filesystem in interface only: a tree of names the kernel agrees to answer questions through.
When you open() and read() one of these files, the kernel runs a
function that gathers the answer at that instant, formats it as text, and hands it to you.
Nothing was stored; everything was computed because you asked. Close the file, read it again,
and the kernel computes a fresh answer.
This design has a consequence worth internalising early: there is no privileged tool. When
top and a shell script disagree about a process's memory, the tiebreaker is to
read the file both of them derive from. When a box is so broken that nothing is installed —
a minimal container image, a rescue shell, an initramfs — cat and ls
against /proc recover most of what the missing tools would have told you. The
rest of this page is the map: which files to read, how to decode the two that confuse
everyone, and where the floor creaks.
The five files that matter
Each running process gets a directory named after its PID, and inside it the kernel publishes several dozen entries. Most are for specialists. Five of them carry nearly all the diagnostic weight, and they are worth knowing the way you know your own phone number.
| Entry | What it holds | When you read it |
|---|---|---|
status | The human-readable summary: state, parent PID, UIDs, VmRSS, Threads, FDSize, signal masks | First stop for any "what is this process doing" question |
fd/ | One symlink per open descriptor, pointing at the file, socket, or pipe behind it | Descriptor leaks, "what does this hold open" — this is lsof's raw source |
limits | The resource limits of the running process: open files, address space, core size | Whenever you suspect a ulimit — your shell's ulimit -n is not evidence |
smaps_rollup | True memory accounting: RSS, PSS, private and shared, summed across all mappings | When "how much memory does it really use" has to survive shared pages |
cmdline, environ, cwd, exe | Exact argv, the environment at exec time, the working directory, and a link to the binary | "What is this thing, really" — identity questions, imposter processes |
Alongside the per-process directories sit the system-wide files. /proc/meminfo is
the machine's memory ledger, the one free
summarises. /proc/loadavg is the three load averages plus the running/total task
count. /proc/stat holds the per-CPU tick counters that
top turns into percentages.
/proc/net/tcp and its siblings are the socket tables, and
/proc/cpuinfo describes the processors. You rarely read these raw when the tools
are present, but knowing they exist changes what "the tool is missing" means: it means
nothing.
Reading /proc/PID/status
status is the file to cat first, every time. It is deliberately
human-formatted — one Key: value pair per line — and a dozen of its lines carry
real diagnostic signal. Here is an excerpt from a Java service, decoded.
$ cat /proc/41327/status Name: java # the comm name, truncated to 15 chars State: S (sleeping) # R running, S sleeping, D uninterruptible, Z zombie Tgid: 41327 Pid: 41327 PPid: 1290 # the parent — who started this? Uid: 1003 1003 1003 1003 # real, effective, saved, fs FDSize: 256 # fd table slots allocated, not fds in use VmPeak: 12482040 kB # largest virtual size ever reached VmSize: 12433016 kB # current virtual size — mostly meaningless VmRSS: 1873204 kB # pages actually resident in RAM right now VmSwap: 0 kB # pages of this process pushed to swap Threads: 47 # thread count — each one lives under task/ voluntary_ctxt_switches: 981244 nonvoluntary_ctxt_switches: 41873 # high and climbing = fighting for CPU
A few of these repay a closer look. State is the same letter
ps prints: the one to fear is D, uninterruptible sleep, which
usually means the process is stuck inside the kernel waiting on I/O and will ignore every
signal you send it. PPid answers "who started this" — a service whose parent is
PID 1 was either started by init or orphaned and adopted. VmSize is virtual
address space, and on modern allocators it is gloriously useless as a memory number: a JVM or
a Go runtime will reserve tens of gigabytes it never touches. VmRSS is the
number that corresponds to physical RAM, and it is what top shows in the RES
column. FDSize trips people up: it is the allocated size of the descriptor
table, a power of two, not the count of open descriptors — for the count, run
ls /proc/41327/fd | wc -l. And Threads tells you this single PID is
actually 47 schedulable threads, which matters in a minute.
Identity questions go to the neighbours. cmdline holds the exact argv the
process was started with, so you can see the real flags rather than what the runbook claims.
cwd is a symlink to the live working directory — the answer to "why can I not
unmount this volume" is often a shell sitting in it. exe is a symlink to the
binary itself, with a property so useful it gets its own scenario below. Be aware that
cmdline and environ delimit their entries with NUL bytes, not
newlines, so a bare cat smashes everything into one unreadable line. Pipe
through tr '\0' '\n' and they turn legible.
RSS, PSS, USS — the memory page nobody decodes
"How much memory does this process use" sounds like one question. It is three, and the reason is shared pages. When two processes map the same library, the kernel keeps one physical copy and points both page tables at it — that is most of the point of virtual memory. But it wrecks naive accounting. If each process reports the shared pages as its own, summing the per-process numbers counts those pages twice, and the total comes out larger than the RAM in the machine. Linux resolves this by giving you three numbers with three different rules about shared pages.
RSS charges every resident page to every process that maps it, in full. It is
cheap to track and honest about one process in isolation, but RSS values must never be summed.
PSS, proportional set size, splits each shared page evenly among its mappers:
a page shared by ten processes adds one tenth of a page to each one's PSS. PSS values sum
correctly — add the PSS of every process and you get the true total of used memory.
USS, unique set size, counts only the pages no one else maps: the memory
that would actually be returned to the system if this process exited right now. The kernel
does not print the name USS, but it is sitting in smaps_rollup as the two
Private lines added together.
$ cat /proc/41327/smaps_rollup 00400000-7ffc8f0c4000 ---p 00000000 00:00 0 [rollup] Rss: 1873204 kB # everything resident, shared counted in full Pss: 1641008 kB # shared pages divided by their mapper count Shared_Clean: 244612 kB # shared, unmodified — mostly library code Shared_Dirty: 512 kB # shared and written to Private_Clean: 12440 kB # ours alone, unmodified Private_Dirty: 1615640 kB # ours alone, written — the heap lives here Swap: 0 kB SwapPss: 0 kB
Read it bottom-up. Private_Clean plus Private_Dirty is the USS: about 1.63 GB that exists only for this process. The Shared lines are the roughly 245 MB of pages this process has in common with others — almost all of it clean library text. RSS is private plus shared in full, 1.87 GB. PSS lands in between, because the shared 245 MB gets divided by however many processes map each page; here it contributes only about 13 MB, which tells you those libraries are shared widely. The spread between the three numbers is itself information: a fleet of worker processes forked from one parent can show a huge combined RSS while the PSS total stays modest, because they are all reading the same copy-on-write pages.
A practical rule of thumb: use RSS to watch one process over time, PSS when you need to add
processes together (capacity planning, "which team's services fill this box"), and USS when
deciding what killing a process would actually buy you. smaps_rollup exists
because the older path — reading /proc/PID/smaps, which prints these counters
for every individual mapping and runs to thousands of lines — was painfully slow to parse.
The rollup makes the per-process totals one cheap-ish read. Cheap-ish, not free; the pitfalls
section explains why.
Three production scenarios
"Too many open files" — but whose limits?
A service starts failing with EMFILE. The first reflex is to check the limit, so
someone SSHes in, runs ulimit -n, sees 65536, and declares the limits fine. That
check proved nothing. ulimit reports the limits of your login shell,
which inherited them from sshd and PAM. The failing service was started by systemd, or a
container runtime, or an init script from 2019, and inherited a completely different set.
Limits are per-process, set at start time, and the only authoritative record is the one the
kernel keeps for the process itself.
$ ulimit -n 65536 # your shell. irrelevant. $ grep "open files" /proc/41327/limits Max open files 1024 4096 files $ ls /proc/41327/fd | wc -l 1019 # five away from the soft limit
Two reads and the diagnosis is complete: the process runs with a soft limit of 1024, it is
holding 1019 descriptors, and the crash is minutes away. Whether those 1019 descriptors are
legitimate or a leak is the next question — ls -l /proc/41327/fd shows what each
one points at, and the lsof page walks that investigation. The lesson generalises: any time
the question is "what constraints does this process run under," read its own
limits, environ, and cgroup files instead of guessing
from a shell that shares nothing with it but a hostname.
The binary was deleted, the process lives on
A deploy replaced /usr/local/bin/worker, but the old process is still running —
and now it is misbehaving in a way the new binary supposedly fixed, and someone wants to
disassemble exactly what is executing. Or worse: the binary was deleted by an attacker
covering tracks, and the only copy left in the world is the one the kernel is executing.
Either way, the file has no name on disk, and either way it does not matter, because deleting
a file only removes its name; the inode survives while anything holds it — and a running
program holds its own text. /proc/PID/exe is a live handle to that inode.
$ ls -l /proc/8841/exe lrwxrwxrwx 1 root root 0 Jun 8 10:42 /proc/8841/exe -> /usr/local/bin/worker (deleted) $ sudo cp /proc/8841/exe /tmp/worker.rescued $ sha256sum /tmp/worker.rescued 9f86d081884c7d65... /tmp/worker.rescued
The (deleted) marker confirms the name is gone, but cp through the
symlink reads the inode's contents byte for byte. You now have the exact binary that is
running, suitable for hashing against your artifact registry, diffing against the new
release, or handing to forensics. The same trick recovers a deleted shared library through
/proc/PID/map_files/, and it is the reason "delete the malware" is not the same
as "stop the malware." The inode-and-link-count machinery this rides on is covered in
file systems.
The process is at 400% CPU — which thread?
top shows one process pinned at 400% CPU. A process is not a unit of execution,
though; its threads are, and this one has 47 of them. The kernel publishes each thread as a
subdirectory of /proc/PID/task/, named by thread ID, with the same files inside
— its own stat, its own status, its own comm. Most
runtimes name their threads, so the busy ones often identify themselves.
$ ls /proc/41327/task | head -4 41327 41342 41358 41389 $ for t in /proc/41327/task/*; do echo "$(cut -d' ' -f1 $t/schedstat) $(cat $t/comm)" done | sort -nr | head -4 7841220047113 GC Thread#0 7790881520446 GC Thread#1 412904118821 http-worker-3 1204481950 main
The first field of a thread's schedstat file is its cumulative on-CPU time in
nanoseconds (fields 14 and 15 of stat carry the same story in user and kernel
ticks); sample it twice a few seconds apart and the deltas tell you which threads are
burning. In this excerpt the garbage collector threads dwarf everything else — a memory problem wearing a
CPU costume. In practice you would let top -H -p 41327 do the sampling and
sorting for you, and the top & htop page
covers that view; the point here is that top -H is not doing anything you could
not do with a shell loop over task/. When the tool's numbers look wrong, the
files are how you check them.
What /proc actually is
It helps to see /proc for what it is underneath: an interface, dressed up as a
filesystem because Unix already had superb plumbing for files. The kernel keeps its real
bookkeeping in internal structures — the task list, the per-process descriptor tables and
memory maps described in
processes, the page
tables described in
virtual memory.
Procfs registers a tree of names, and behind each name a handler function. open()
on /proc/41327/status does not find data; it finds the handler.
read() invokes it, the handler walks the live task structure for PID 41327,
formats what it finds as text, and that text is your file contents. The file is the
conversation, not a thing that exists between conversations.
This explains every oddity at once. Sizes show as zero because there is nothing to measure
until a read happens. Contents change between reads because each read recomputes. Listing
/proc is enumerating the task list, which is why directories appear and vanish
as processes start and exit. Permissions are real and enforced — most of another user's
/proc/PID internals are unreadable without privilege, which is exactly why an
unprivileged lsof sees so little. And reads have genuine cost, because "render
this answer" can mean real kernel work: a read of smaps_rollup walks the
process's page tables to count page references. The filesystem metaphor is so good that it is
easy to forget you are calling into the kernel every time, but you are.
The same trick appears elsewhere once you know to look. /sys exposes devices and
kernel parameters through the identical render-on-read mechanism, one value per file.
/proc/sys/ goes further and accepts writes — echo a number into
/proc/sys/vm/swappiness and you have changed a kernel tunable with no tool but
the shell. The design lesson is the one Unix keeps teaching: if you expose state as files,
every existing tool — cat, grep, watch, a shell loop —
becomes a client for free.
Pitfalls
Summing PSS across a busy box is expensive. Every read of
smaps_rollup makes the kernel walk that process's page tables, and a loop over
two thousand processes is two thousand walks. It will finish, but on a loaded machine it can
take seconds of CPU and visibly perturb the thing you are measuring. Do it when you need the
true total; do not put it in a one-second metrics loop. (And never parse full
smaps when smaps_rollup exists — same data, hundreds of times more
text.)
environ and cmdline are NUL-delimited — and environ is a snapshot. The
missing newlines are a five-second annoyance fixed with tr '\0' '\n'. The deeper
trap is temporal: environ shows the environment as it was at exec
time. A process that called setenv() afterwards, or a runtime that mutates its
environment, will not be reflected. It answers "what was this started with," not "what does
it see now."
PIDs race. Between you reading /proc/41327/status and
/proc/41327/fd, the process can exit and the kernel can hand 41327 to a brand-new
process. Default PID space is small (32768, see /proc/sys/kernel/pid_max) and
wraps fast on busy machines. Scripts that walk /proc must tolerate directories
vanishing mid-walk — an ENOENT there is weather, not an error — and anything
that needs a stable handle on a process should hold its /proc/PID directory
open or use pidfds rather than trusting the number twice.
In a container, /proc tells you about the host. Procfs reports kernel state,
and a container shares the host's kernel. So /proc/meminfo inside a container
with a 512 MB cgroup limit cheerfully reports the host's 64 GB, and
/proc/loadavg reports the load of every tenant on the machine. Generations of
runtimes sized their heaps and thread pools off these files and then died confused when the
cgroup limit arrived first. The real limits live under /sys/fs/cgroup/, and
some platforms mount lxcfs over the offending files to fake container-local
views — which means inside a container you cannot even assume the lie is consistent. PID
namespaces add a second twist: /proc inside the container lists only the
container's processes, renumbered from 1, so the PID you see inside is not the PID the host
sees.
Reading is not always harmless. Almost everything in this page is
read-only and safe, but /proc has writable corners — /proc/sys/
tunables, /proc/sysrq-trigger — where a stray redirect changes kernel behaviour
immediately, no confirmation asked. The habit to build: cat freely, but treat
any > aimed inside /proc with the respect you would give a
config change in production, because that is what it is.
A drill you can run right now
Everything below is read-only and safe on any Linux machine, shared boxes included. The
subject is your own shell, which the variable $$ always names, so no permissions
are needed and nothing can be disturbed. Ten minutes, and the five files stop being a list
and become places you have been.
Step 1 — meet your shell. Read its summary and pick out the lines you now know:
$ cat /proc/$$/status | head -20 Name: bash State: S (sleeping) PPid: 2204 # trace it: that's sshd or your terminal VmRSS: 5288 kB Threads: 1 $ tr '\0' '\n' < /proc/$$/environ | head -5 LANG=en_US.UTF-8 HOME=/home/nilesh SHELL=/bin/bash
Follow the PPid upward — cat /proc/2204/comm — and keep going until
you hit PID 1. You have just walked the process tree by hand, which is all
pstree does.
Step 2 — descriptors, limits, and place. Look at what your shell holds open and the rules it runs under:
$ ls -l /proc/$$/fd lrwx------ 1 nilesh nilesh 64 Jun 8 11:02 0 -> /dev/pts/0 lrwx------ 1 nilesh nilesh 64 Jun 8 11:02 1 -> /dev/pts/0 lrwx------ 1 nilesh nilesh 64 Jun 8 11:02 2 -> /dev/pts/0 $ grep "open files" /proc/$$/limits Max open files 1024 1048576 files $ ls -l /proc/$$/cwd /proc/$$/exe lrwxrwxrwx ... /proc/12871/cwd -> /home/nilesh lrwxrwxrwx ... /proc/12871/exe -> /usr/bin/bash
Descriptors 0, 1, and 2 all point at your terminal device — standard input, output, and
error, visible as plain symlinks. Now run exec 3< /etc/hostname, list
fd again, and watch descriptor 3 appear; exec 3<&- closes it
and it vanishes. You have watched the kernel's descriptor table change in real time, which is
the entire mechanism behind lsof.
Step 3 — catch a tool red-handed. Prove that free is a
formatter, not a source:
$ head -3 /proc/meminfo MemTotal: 16384256 kB MemFree: 1204480 kB MemAvailable: 9842116 kB $ free -k | head -2 total used free shared buff/cache available Mem: 16384256 6021048 1204480 312044 9158728 9842116
Same numbers, to the kilobyte: total is MemTotal,
free is MemFree, available is
MemAvailable, and the rest is arithmetic over a few more lines of the same file.
If you have strace handy, strace -e openat free shows the
openat("/proc/meminfo") call outright — the
strace page makes a habit of this kind of
unmasking. Finish by reading cat /proc/loadavg and comparing it with
uptime's output, and the lesson is complete: the tools are conveniences. The
files are the truth.
/proc/PID/status for what a process
is, /proc/PID/fd for what it holds, /proc/PID/limits for the rules
it actually runs under — and tr '\0' '\n' whenever the output arrives as one
long line.Further reading
- proc(5) — the manual page — the full catalogue of every entry. Long, but the per-process section repays a slow read with coffee.
- The /proc filesystem — kernel documentation — the kernel's own description, including the smaps and meminfo field definitions in their original habitat.
- pidfd_open(2) — the modern fix for the PID-reuse race: a file descriptor that names a process stably.
- Semicolony — lsof — the previous stop on this track, and the clearest worked example of a tool that is /proc with formatting.