Linux, at the terminal.
The layer between shell syntax and kernel theory: the investigative one. You know how to cd around and you've heard of the scheduler — what's missing is the middle, where a box is misbehaving and you have to find out why with the tools already installed on it. Twenty-eight pages, organised by the question you're asking rather than alphabetically, each built around the five flags that matter, annotated real output, three production scenarios, and how the thing works underneath.
Why organised by question
Nobody debugging a server thinks alphabetically. You think in questions: is this machine overloaded? Who's holding port 8080? Why is this process stuck? Is the problem on the wire or in the code? Tool lists answer "what does lsof do" — the harder skill is the reverse lookup, from symptom to the command that will tell you something. So the codex is arranged as the questions, in roughly the order you'd ask them during an incident: the machine first, then resources, then the process itself, then the network, then the text tools that glue an investigation together.
It also stays deliberately in the middle of the stack. Below this sits kernel theory — schedulers, the VFS, page reclaim — which is interesting but rarely the thing between you and a fix. Above it sits shell syntax, which you already have. The middle layer is the one most engineers pick up by shoulder-surfing someone senior during an outage. These pages are that, written down.
The map
What is this machine doing?
The thirty-second triage. Before you blame the code, the database, or the network, these two pages tell you whether the box itself is CPU-bound, memory-starved, or waiting on disk — and how to tell those apart.
Who is holding what?
Ports, files, sockets — everything on Linux is a file descriptor, and these two commands tell you which process owns which one. The answer to half of all "address already in use" tickets.
What is this process doing?
When the process is up but wrong — slow, stuck, leaking — you can watch it from the outside without touching the code. Syscalls, the /proc window, and how to stop a process without losing data.
- 05
strace
Attaching to a live process, -e trace= to cut the noise, reading an ENOENT storm.
Read - 06
/proc
The kernel's window into every process: fd/, status, limits, smaps, and what to read first.
Read - 07
kill & signals
TERM vs KILL vs HUP, what a graceful shutdown actually is, why -9 is the last resort.
Read
Is it the network?
"It's always DNS" is a joke because it's usually true. These two pages cover watching packets on the wire and interrogating resolvers, so you can prove where a request died instead of guessing.
Text, logs, glue
Most evidence arrives as text. Logs from the journal and the kernel, files scattered across a tree, JSON from every API you touch — these three pages are how you cut it down to the line that matters.
- 10
journalctl & dmesg
-u and --since to narrow the window, kernel messages, finding the OOM-killer line.
Read - 11
find & xargs
-exec vs piping to xargs, null-delimiting so filenames with spaces don't bite you.
Read - 12
jq
Filters, map and select, turning a wall of JSON into the three fields you wanted.
Read
The longer tail
The second dozen — disk triage, the modern network tools, profiling, priorities and cgroups, limits, the text-processing workhorses, and the survival kit that keeps a long session alive.
- 17
df, du & ncdu
Filesystem vs file accounting, inode exhaustion, and ncdu for fast triage.
Read - 18
ip
The ifconfig replacement: addr, route, link, neigh — reading a box's network identity.
Read - 19
nc & mtr
Netcat as a universal socket tool; mtr as the traceroute that tells the truth about loss.
Read - 20
perf
Sampling the CPU: perf top, record and report, flame graphs, and reading perf stat.
Read - 21
nice, ionice & cgroups
Priorities, IO classes, and the cgroup v2 view — what "throttled" means in a container.
Read - 22
ulimit & limits
"Too many open files": soft vs hard limits, /proc/PID/limits, systemd's LimitNOFILE.
Read - 23
awk & sed
The ten one-liners that replace a script: columns, sums, filters, in-place edits.
Read - 24
The pipeline
sort, uniq, cut, tr, wc — the histogram-from-logs pattern and friends.
Read - 25
watch, time & tmux
Watching things change, real vs user vs sys, and keeping sessions alive across disconnects.
Read - 26
systemctl
Units, reading status output properly, restart vs reload, enable vs start, drop-ins and timers.
Read
The investigations
Six full incident walkthroughs, each starting from a symptom and ending at a root cause. Every step uses a command from the pages above — this is the material the rest of the codex builds toward.
- 13
What's eating my CPU?
From a pegged load average to the exact thread and the line of code behind it.
Read - 14
What's eating my memory?
Cache vs leak vs ballooning heap, and finding which one you have before the OOM killer does.
Read - 15
What's holding this port?
"Address already in use" — finding the owner, deciding whether to kill it, and TIME_WAIT.
Read - 16
Why is the disk full?
When df and du disagree: deleted-but-open files, inode exhaustion, and runaway logs.
Read - 27
Is it the network?
Splitting blame between the wire and the code: loss, latency, DNS, and the curl -w timings.
Read - 28
The box is slow
The general sixty-second triage, USE-method shaped, that routes you to the right deep dive.
Read
How to read these pages
Every page has the same spine, so once you've read one you know where to look in all of them:
- The five flags that matter. Most of these commands have a hundred-plus options; you'll use five of them for ninety-odd percent of real work. Each page names those five and ignores the rest.
- Annotated real output. Actual terminal output with the lines you should care about marked, because the hard part of top isn't running it — it's knowing which of the forty numbers on screen means trouble.
- Three production scenarios. Short stories in the shape real incidents take: a symptom, the command, what the output showed, what it meant.
- How it works underneath. One level down, no further. Enough of /proc, syscalls, or the socket API to make the command's behaviour predictable instead of memorised.
Before you run anything
Almost nothing here is dangerous, but a few habits keep it that way:
- Read-only first. Almost everything in this codex observes rather than changes. Get comfortable with the observing commands before you reach for kill, and you'll rarely make an incident worse.
- Your own machine, or a throwaway VM. Every page's examples run fine on a laptop or a free-tier VM. Practising strace on a scratch box costs nothing; practising it for the first time during an outage costs plenty.
- Man pages are the reference. These pages cover the five flags that earn their keep, not all two hundred. When you need the long tail, man tcpdump is better than any tutorial — these pages teach you enough to read it.
- Prod etiquette: observe before you act. On a shared production box, run the read-only commands first, say what you're about to do out loud (or in the incident channel), and prefer the gentlest signal that works. Heroics with kill -9 make better stories than outcomes.