28 pages · question-shaped
Codex / Linux

Linux, at the terminal.

The layer between shell syntax and kernel theory: the investigative one. You know how to cd around and you've heard of the scheduler — what's missing is the middle, where a box is misbehaving and you have to find out why with the tools already installed on it. Twenty-eight pages, organised by the question you're asking rather than alphabetically, each built around the five flags that matter, annotated real output, three production scenarios, and how the thing works underneath.


Why organised by question

Nobody debugging a server thinks alphabetically. You think in questions: is this machine overloaded? Who's holding port 8080? Why is this process stuck? Is the problem on the wire or in the code? Tool lists answer "what does lsof do" — the harder skill is the reverse lookup, from symptom to the command that will tell you something. So the codex is arranged as the questions, in roughly the order you'd ask them during an incident: the machine first, then resources, then the process itself, then the network, then the text tools that glue an investigation together.

It also stays deliberately in the middle of the stack. Below this sits kernel theory — schedulers, the VFS, page reclaim — which is interesting but rarely the thing between you and a fix. Above it sits shell syntax, which you already have. The middle layer is the one most engineers pick up by shoulder-surfing someone senior during an outage. These pages are that, written down.

The map

A · First look

What is this machine doing?

The thirty-second triage. Before you blame the code, the database, or the network, these two pages tell you whether the box itself is CPU-bound, memory-starved, or waiting on disk — and how to tell those apart.

B · Resources

Who is holding what?

Ports, files, sockets — everything on Linux is a file descriptor, and these two commands tell you which process owns which one. The answer to half of all "address already in use" tickets.

C · Inside a process

What is this process doing?

When the process is up but wrong — slow, stuck, leaking — you can watch it from the outside without touching the code. Syscalls, the /proc window, and how to stop a process without losing data.

D · The wire

Is it the network?

"It's always DNS" is a joke because it's usually true. These two pages cover watching packets on the wire and interrogating resolvers, so you can prove where a request died instead of guessing.

E · The toolbelt

Text, logs, glue

Most evidence arrives as text. Logs from the journal and the kernel, files scattered across a tree, JSON from every API you touch — these three pages are how you cut it down to the line that matters.

How to read these pages

Every page has the same spine, so once you've read one you know where to look in all of them:

  • The five flags that matter. Most of these commands have a hundred-plus options; you'll use five of them for ninety-odd percent of real work. Each page names those five and ignores the rest.
  • Annotated real output. Actual terminal output with the lines you should care about marked, because the hard part of top isn't running it — it's knowing which of the forty numbers on screen means trouble.
  • Three production scenarios. Short stories in the shape real incidents take: a symptom, the command, what the output showed, what it meant.
  • How it works underneath. One level down, no further. Enough of /proc, syscalls, or the socket API to make the command's behaviour predictable instead of memorised.
Start with the investigations. If you'd rather learn backwards from a problem, skip straight to phase F — What's eating my CPU?, memory, a held port, and a full disk. Each one walks an incident end to end and links back to the tool pages as it uses them, so you can read in whichever direction sticks.

Before you run anything

Almost nothing here is dangerous, but a few habits keep it that way:

  • Read-only first. Almost everything in this codex observes rather than changes. Get comfortable with the observing commands before you reach for kill, and you'll rarely make an incident worse.
  • Your own machine, or a throwaway VM. Every page's examples run fine on a laptop or a free-tier VM. Practising strace on a scratch box costs nothing; practising it for the first time during an outage costs plenty.
  • Man pages are the reference. These pages cover the five flags that earn their keep, not all two hundred. When you need the long tail, man tcpdump is better than any tutorial — these pages teach you enough to read it.
  • Prod etiquette: observe before you act. On a shared production box, run the read-only commands first, say what you're about to do out loud (or in the incident channel), and prefer the gentlest signal that works. Heroics with kill -9 make better stories than outcomes.
Found this useful?