24 / 28

Linux / 24

The pipeline

You have fifty thousand lines of access log and one question: which IP is hammering us? Somewhere in that wall of text is a ranked list five lines long, and the shortest path to it is not a script, a notebook, or a dashboard. It is a pipeline: four small tools, each doing one transformation, connected by streams. This page covers the cast — sort, uniq, cut, tr, wc, head, tail, tee — with the two flags each one actually needs, teaches the one pattern that answers most log questions, explains what a pipe really is inside the kernel, and ends with a drill on a fake log you build yourself.

The question it answers

The recurring shape of operational work is: a lot of text goes in, a small answer comes out. How many errors since the deploy? Which endpoint is slowest? Which IP appears most often? What changed between these two config dumps? Each of these starts with thousands of lines and ends with a number or a short ranked list. The pipe is the machinery that gets you from one to the other, and it does it by composition rather than by features.

The idea is old and it has aged well. Instead of one program that reads logs, filters them, counts things, and sorts the result, you get a handful of programs that each do exactly one of those transformations, and a connector — the | — that feeds the output of one into the input of the next as a stream of lines. grep filters lines. awk and cut pick columns out of lines. sort reorders lines. uniq collapses repeated lines. wc counts them. head keeps the first few. None of these knows anything about the others; each reads from standard input and writes to standard output, and the shell wires them together. Because the interface is just "lines of text," any tool composes with any other, including ones written forty years apart.

A pipeline is a chain of single-purpose transformations. tee is the odd one out: it passes the stream through unchanged while writing a copy to a file, which is how you inspect the middle of a pipe.

Two things make this more than a party trick. First, the composition is interactive: you build the pipeline one stage at a time, looking at the output after each addition, so by the time it is five stages long you have already verified four of them. Second, the streams are real concurrency — every process in the pipeline runs at once, connected by kernel buffers, which we will open up later in this page. But first, the cast.

The cast, two flags each

Every tool here has a man page ten times longer than it deserves. What follows is each one's job in a sentence and the small set of flags that carry nearly all real use.

sort

sort reorders lines. By default it compares them as strings, left to right, which is exactly wrong for numbers: 10 sorts before 2 because the character 1 comes before the character 2. The flag you will type most is -n, numeric compare, and its sibling -h, human-numeric, which understands the suffixed sizes that du -h and friends emit, so 2.1G outranks 980M. Add -r to either for descending order, which is what "top ten" means.

The other three flags worth owning: -k picks which field to sort by (sort -k3 -n sorts on the third whitespace-separated field), -t changes the field separator (sort -t: -k3 -n /etc/passwd sorts users by UID), and -u deduplicates while sorting, equivalent to sort | uniq in one process. sort is also the only tool in this list that cannot stream — it has to see the last line before it can emit the first — which has consequences we will get to.

uniq

uniq collapses adjacent duplicate lines into one. That word "adjacent" is the whole tool. It reads one line at a time and compares it only to the previous line; if they match, it suppresses the repeat. It keeps no memory beyond that single line, which is what makes it fast and stream-friendly, and it is also why uniq on unsorted input silently gives you a wrong answer: equal lines that are not next to each other are never compared, so they all survive. sort exists in front of uniq to push equal lines together so that "adjacent" and "equal" become the same thing.

The flag that earns uniq its place in the pattern below is -c: instead of just collapsing a run of equal lines, prefix the survivor with how long the run was. That single flag turns a deduplicator into a histogram builder. The runner-up is -d, print only the lines that were duplicated, which is a fast way to ask "which entries appear more than once in this list" — duplicate user IDs, repeated hostnames in a config, that sort of thing.

cut, and when to reach for awk instead

cut slices columns out of lines. -d sets the delimiter, -f picks the fields: cut -d: -f1 /etc/passwd is the canonical example, every username on its own line. It is small, fast, and predictable, and for cleanly delimited data like /etc/passwd or a CSV without quoted commas it is the right tool.

Its weakness is that the delimiter is one literal character, counted strictly. Two spaces in a row mean an empty field between them, so cut -d' ' -f1 on log output with aligned columns picks up empty strings where you expected values. The moment your columns are separated by runs of whitespace — which is most command output — use awk instead: awk '{print $1}' splits on any amount of whitespace, ignores leading space, and never surprises you. The rule of thumb: one literal delimiter, cut; whitespace or anything conditional, awk. The full awk story, with sed alongside it, lives in awk & sed.

tr

tr transliterates characters: it reads the stream and replaces every character in one set with the corresponding character in another. tr 'a-z' 'A-Z' upcases, tr ',' '\n' turns a comma-separated value into one item per line, which is often the first move when something hands you a single long line and every other tool in this list wants lines.

Its two distinctive flags are -s (squeeze) and -d (delete). tr -s ' ' collapses every run of spaces into a single space, which rescues cut from the aligned-columns problem above. tr -d '\r' deletes carriage returns, the classic fix for a file that came from Windows and is quietly breaking your string comparisons. Note that tr works on characters, never strings — it cannot replace "foo" with "bar." That is sed's job.

wc

wc counts. -l counts lines, -c counts bytes, -w counts words, and in pipelines you will use -l for roughly all of it, because in this world "how many" almost always means "how many lines." grep ERROR app.log | wc -l is the first pipeline most people ever write, and it remains the fastest way to put a number on a hunch. (When the count is the only thing you need, grep -c ERROR app.log gets there in one process — but counting the output of an arbitrary pipeline is what wc -l is for.)

One habit worth forming: wc -l is the cheapest sanity check available, so use it between stages while you build. Before trusting a five-stage pipeline, run the first two stages into wc -l and confirm the number is the size you expected. A filter that matched nothing, or matched everything, announces itself immediately as a count of zero or a count that did not shrink.

head and tail

head -n 20 keeps the first twenty lines and discards the rest; tail -n 20 keeps the last twenty. At the end of a pipeline, head is how a ranked list becomes a top ten: everything before it produces the full ranking, and head truncates it. It also ends the pipeline early in a literal sense — once head has its lines it exits, and the kernel tears down everything upstream, a mechanism explained in the internals section below.

tail has one more trick that changes its nature entirely: tail -f does not stop at the end of the file. It waits, and prints new lines as they are appended, turning a static file into a live stream. tail -f access.log | grep 502 is a poor man's alerting system during an incident — with one buffering caveat that has its own entry in the pitfalls. For logs that live in the journal rather than in files, the equivalent live view is journalctl -f, covered in journalctl & dmesg.

tee

Every other tool here transforms the stream; tee just watches it. It copies standard input to standard output unchanged, and also writes the same bytes to the file you name — a T-junction in the pipe, hence the name. Stick it between two stages and the pipeline behaves exactly as before, but you get a file containing whatever flowed through that point.

That makes it the debugger for pipelines: when a five-stage pipeline produces nonsense, the question is always "which stage went wrong," and tee /tmp/mid.txt dropped after a suspect stage answers it without dismantling anything. The flag worth knowing is -a, append instead of overwrite, for collecting across runs. The other daily use is capturing while watching: ./deploy.sh 2>&1 | tee deploy.log shows you the output live and keeps a copy for the postmortem.

The pattern: the histogram

Most ranked-list questions reduce to one reusable shape. Learn it once and you will type it for the rest of your career:

extract the thing | sort | uniq -c | sort -rn | head

Read it as four verbs. Extract: reduce each input line to just the value you want to count — an IP, a status code, a path — usually with awk or cut. Group: sort brings equal values together so they are adjacent. Count: uniq -c collapses each run into one line prefixed with its length. Rank: sort -rn orders by that count, descending and numeric, and head keeps the top of the ranking. The first sort exists only to serve uniq; the second exists to serve you.

The histogram pattern, stage by stage. Watch where the data actually shrinks: uniq -c collapses 52,114 rows into 1,204 unique values, and head keeps ten. The two sorts never change the count, only the order.

Here it is three times on the same combined-format access log, where field 1 is the client IP, field 7 is the request path, and field 9 is the status code. Same shape every time; only the extraction changes.

# top client IPs
$ awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -5

# status code distribution (few enough values that head is optional)
$ awk '{print $9}' access.log | sort | uniq -c | sort -rn

# top request paths
$ awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -5

And here is the first one with its output, annotated. This is the moment a wall of text becomes a ranked list:

$ awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -5
   9123 198.51.100.7      <- count from uniq -c; value from the stream
   1844 203.0.113.42      <- one row per unique IP, biggest first
    977 192.0.2.150
    412 203.0.113.9
    301 198.51.100.23     <- head -5 cut the ranking here

The leading whitespace before each count is uniq -c right-aligning its numbers; sort -rn handles it fine because numeric sort skips leading blanks. One IP with five times the requests of the runner-up is exactly the kind of asymmetry this pattern exists to surface.

Why two sorts. They do different jobs and want different flags. The first sort is lexical and exists purely so equal values are adjacent for uniq; you never look at its output. The second is numeric and descending because now the lines start with counts and you want the biggest one first. Swapping their flags — numeric first, lexical second — is a classic way to get a confidently wrong top ten.

Variations come cheap once the spine is in place. Insert a grep before the extraction to scope the question: only the last hour, only one endpoint, only 5xx responses. Make the extraction smarter to change the question entirely: awk '{print $9, $7}' histograms status-and-path pairs, which tells you not just that 502s spiked but where. The pattern also assumes line-oriented text; when the input is JSON, jq does the extraction step, and the rest of the pipeline stays the same. And when the input is not a file but a set of files scattered across a tree, find & xargs is the stage that comes before stage one.

Three production scenarios

Which IP is hammering us?

Latency alarms are firing, the service is shedding load, and the first question in the incident channel is whether the traffic spike is organic or one badly behaved client. Grab a recent window of the access log and run the histogram. tail gives you the window without re-reading a multi-gigabyte file:

$ tail -n 100000 access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -3
  61240 198.51.100.7
   1893 203.0.113.42
   1761 192.0.2.150

Sixty-one thousand of the last hundred thousand requests came from one address. That is not organic, and you now have something concrete to do: rate-limit it, block it at the edge, or look up whose NAT gateway it is before you do either. The follow-up question — what is it actually requesting — is the same pipeline with a grep in front: grep '^198.51.100.7 ' access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head. If all sixty-one thousand requests hit one search endpoint, you are probably looking at a scraper; if they spray across everything, more likely a retry storm from a misconfigured client. Thirty seconds of pipeline work, and the incident has a direction.

What changed between these two config dumps?

Two environments are supposed to be identical and are behaving differently. You dump the effective config from each — environment variables, sysctls, feature flags, package lists — and you want the difference. diff on the raw dumps is noisy because the two systems emit keys in different orders, and diff compares positionally. Sort both first and ordering noise vanishes:

$ sort prod.env > /tmp/a && sort staging.env > /tmp/b
$ diff /tmp/a /tmp/b
12c12
< DB_POOL_SIZE=40
---
> DB_POOL_SIZE=10

For set-style questions, comm is the sharper tool. It takes two sorted files and produces three columns: lines only in the first, lines only in the second, lines in both, and its flags suppress columns rather than select them. comm -23 a b is "in a but not b" (suppress columns 2 and 3), comm -13 a b is "in b but not a," and comm -12 a b is the intersection. "Which packages are on prod but missing on staging" is comm -23 on two sorted package lists, one line per missing package, no diff markers to parse. Like uniq, comm trusts you on the sorting — give it unsorted input and it produces garbage, with at most a warning.

How many distinct values are in here?

Cardinality questions come up constantly: how many distinct users hit the API today, how many unique error signatures are in this log, how many different source IPs touched this port. The whole histogram is overkill when you only need the count of unique values:

$ awk '{print $1}' access.log | sort -u | wc -l
1204

sort -u deduplicates, wc -l counts the survivors. On anything that fits on one machine this finishes in seconds — GNU sort handles inputs larger than memory by spilling sorted runs to temporary files and merging them, so even a file bigger than RAM works; it is just slower and needs scratch space (point -T somewhere roomy if /tmp is small). Is the answer exact? Yes. Is the method efficient? Not very — sorting the whole stream to count distinct values is more work than the question strictly needs, and at data-warehouse scale you would reach for an approximate counter instead. But for a few hundred million lines on a box you are already logged in to, "exact answer in under a minute, one line to type" wins. Knowing when the dumb-but-correct method is good enough is most of the skill.

What a pipe actually is

The | looks like syntax, but it is a kernel object. When the shell runs sort | uniq -c, it asks the kernel for a pipe — a small in-kernel byte buffer, 64 KiB by default on Linux — then starts both processes at the same time, with sort's standard output wired to the buffer's write end and uniq's standard input wired to its read end. There is no temporary file and no staging: bytes go from one process's memory into the kernel buffer and out into the other process's memory. You can see pipes sitting in descriptor tables as pipe:[id] entries under /proc, and the pipe's place in the wider family of process-to-process plumbing is covered in interprocess communication.

The buffer being small and fixed is a feature, because it gives you backpressure for free. If the writer outruns the reader, the buffer fills and the writer's next write() simply blocks until the reader drains some bytes. If the reader outruns the writer, its read() blocks until bytes arrive. No polling, no retry loops, no unbounded queue eating memory: the kernel pauses whichever side is ahead. This is why a pipeline's memory use stays flat no matter how much data flows through it — as long as every stage streams.

Not every stage can stream, and the distinction is worth keeping in your head. grep, awk, cut, tr, uniq, and head are all line-at-a-time machines: read a line, transform or discard it, emit, forget. Their memory use is one line, regardless of input size. sort cannot work that way, because the first line of its output might be the last line of its input — it must take in everything before it can emit anything. So a pipeline containing sort has a dam in the middle: everything upstream streams into sort, nothing comes out the other side until upstream finishes, and then the downstream stages get the whole sorted flood. The pipeline still works; it just stops being a flat-memory streaming system at that one stage, which is exactly when sort's spill-to-disk behaviour from the cardinality scenario earns its keep.

Pipes also explain how pipelines end early. When head has printed its ten lines, it exits, closing the read end of the pipe feeding it. The next time the upstream process writes to that pipe, the kernel delivers SIGPIPE, which kills it by default, which closes its input pipe, and the teardown dominoes all the way to the start. This is why sort -rn bigfile | head does not waste time writing millions of lines nobody will read: the consumer's exit propagates backwards and shuts the producers down.

One consequence shows up in scripts. A pipeline's exit status is the last command's status, so grep ERROR missing-file.log | wc -l exits 0 — wc succeeded, never mind that grep could not open the file and the 0 it counted means nothing. In bash, set -o pipefail changes the rule: the pipeline's status becomes the rightmost failure, so the broken grep surfaces. The standard prologue for any shell script you care about is set -euo pipefail. One interaction to know about: under pipefail, the SIGPIPE teardown above counts as a failure too — a producer killed mid-write exits with status 141, so somecmd | head can "fail" in a script even though everything worked exactly as designed. When that bites, either handle 141 explicitly or restructure so head is not truncating a still-writing producer.

Pitfalls

uniq without sort. The classic, and worth restating as a failure mode: on unsorted input, uniq -c counts runs, not totals, so one IP scattered through the log shows up as dozens of small counts instead of one big one — and your "top ten" ranks whoever happened to cluster, not whoever sent the most. The output looks plausible, which is what makes it dangerous. If the same value appears on more than one line of uniq -c output, the input was not sorted.

Lexical sort on numbers. Plain sort orders 1, 10, 2, 20, 3 because it compares character by character. Anywhere a count, a size, or a duration is being ranked, you want -n (or -h for suffixed sizes). The sneakiest version is forgetting -n on the second sort of the histogram: counts of 9 rank above counts of 1000, the top of the list is merely wrong rather than obviously broken, and nothing errors.

Locale-dependent ordering. sort obeys the locale, and locales like en_US.UTF-8 use collation rules that can ignore case and punctuation in ways that surprise tools downstream — comm and join may declare input "not sorted" that sort itself just produced under a different locale. Prefixing LC_ALL=C forces plain byte order: LC_ALL=C sort file. It is also a real speed win, since byte comparison skips the collation machinery entirely — on big files, several times faster. For pipelines feeding machines rather than humans, LC_ALL=C is the safe default.

The useless cat. cat access.log | grep 502 works, and nobody should be sneered at for typing it — building pipelines left-to-right starting from "show me the file" is a perfectly sane habit. But grep 502 access.log does the same thing with one fewer process and one fewer copy of every byte, and tools that take a filename can sometimes be smarter with one (parallelism, mmap) than with an anonymous stream. Worth knowing mostly so the day someone says "useless use of cat" you can nod instead of googling it mid-meeting.

Buffering surprises on live streams. tail -f access.log | grep 502 | awk '{print $7}' sits silent even as 502s pour in, and the reason is invisible: when stdout is a terminal, tools flush every line, but when stdout is a pipe, libc switches to block buffering and grep holds output in a 4 KiB buffer until it fills. On a trickle of matches, "until it fills" can be minutes. The fix is grep --line-buffered, or generically stdbuf -oL somecmd for tools without such a flag. The rule: in any pipeline that should show results live, every stage except the last needs to be told to flush by line.

A drill you can run right now

Everything below is safe on any machine with a shell: it writes two small throwaway files under /tmp and touches nothing else. Ten minutes, and the histogram stops being a recipe and becomes something you have built from parts.

Step 1 — build a fake access log. Twenty lines, one noisy client baked in. Every even-numbered line comes from 203.0.113.7 failing against /api/orders; the odd lines spread across three well-behaved IPs:

$ for i in $(seq 1 20); do
>   if [ $(( i % 2 )) -eq 0 ]; then ip=203.0.113.7; path=/api/orders; code=500;
>   else ip=10.0.0.$(( i % 3 + 1 )); path=/api/users; code=200; fi
>   echo "$ip - - [08/Jun/2026:14:00:$(printf '%02d' $i) +0000] \"GET $path HTTP/1.1\" $code 512"
> done > /tmp/lab.log
$ wc -l /tmp/lab.log
20 /tmp/lab.log

Look at a couple of lines with head -3 /tmp/lab.log and identify the fields: IP is field 1, path is field 7, status is field 9 — the same combined-format positions as a real nginx or Apache log.

Step 2 — run the three histograms. Top IPs first:

$ awk '{print $1}' /tmp/lab.log | sort | uniq -c | sort -rn
     10 203.0.113.7
      4 10.0.0.2
      3 10.0.0.3
      3 10.0.0.1

The noisy client you planted in step 1 floats to the top, exactly as a real one would. Now run the other two yourself: awk '{print $9}' for the status distribution (expect a clean 10/10 split between 500 and 200) and awk '{print $7}' for paths. Then make the extraction smarter: awk '{print $9, $7}' histograms status-and-path pairs and shows in one view that every 500 is on /api/orders.

Step 3 — break it on purpose. Drop the first sort and run awk '{print $1}' /tmp/lab.log | uniq -c | sort -rn. Because the noisy IP alternates with the others, no two adjacent lines match, and every line gets a count of 1 — the wrong answer from the pitfalls section, produced live and harmlessly. Then try the second sort without -n on this data and note the tie-breaking changes; on bigger data, missing -n is what puts a count of 9 above 1000.

Step 4 — tap the middle with tee. Put a tap between sort and uniq -c and look at what actually flows there:

$ awk '{print $1}' /tmp/lab.log | sort | tee /tmp/mid.txt | uniq -c | sort -rn
     10 203.0.113.7
      4 10.0.0.2
      3 10.0.0.3
      3 10.0.0.1
$ head -4 /tmp/mid.txt
10.0.0.1
10.0.0.1
10.0.0.1
10.0.0.2

The pipeline's output is unchanged — tee is invisible to the stages around it — but /tmp/mid.txt now holds the intermediate stream: every IP, one per line, sorted so the duplicates sit together in runs. That file is what uniq -c sees, and seeing it makes the whole pattern concrete: sort built the runs, uniq measured them. The next time a five-stage pipeline gives you nonsense at 2am, this is the move — tee after each stage until you find the one whose output stopped matching your mental model. Clean up with mv /tmp/lab.log /tmp/lab.log.removed or just leave it; it is two kilobytes in /tmp.

If you remember one line. ... | sort | uniq -c | sort -rn | head — extract, group, count, rank. The first sort serves uniq, the second serves you, and tee /tmp/mid.txt between any two stages shows you what is really flowing there.

The pipeline

The question it answers

The cast, two flags each

sort

uniq

cut, and when to reach for awk instead

tr

wc

head and tail

tee

The pattern: the histogram

Three production scenarios

Which IP is hammering us?

What changed between these two config dumps?

How many distinct values are in here?

What a pipe actually is

Pitfalls

A drill you can run right now

Further reading

25 — watch, time & tmux