23 / 28

Linux / 23

awk & sed

You have ten thousand lines of log and one question: which endpoint is throwing the 500s? The answer is in there, spread across a column, and writing a Python script for it feels like renting a crane to move a chair. This is the job awk and sed were built for. awk is a tiny programming model that runs once per line; sed is an editor that edits once per line. Learn the two models — not a grab-bag of memorised incantations — and most text questions collapse into one line you can type from memory. This page teaches the models, annotates the one-liners worth keeping, walks three production jobs, and ends with a drill on a fake log you build yourself.

The question they answer

Both tools answer the same recurring question: how do I pull the answer out of this text without writing a script? The text might be an access log, the output of ps, a CSV someone exported from a dashboard, a config file, or whatever journalctl just handed you. The answer might be a sum, a histogram, one column out of nine, or the same file with one string swapped for another. You could open an editor. You could write Python. But during an incident, or in the middle of a pipeline, you want the answer in the time it takes to type a line.

The two tools split the territory. awk is for computing over text: sums, counts, filters on a column's value, grouping by a key. It is a real programming language, but one with an unusual shape — your program runs once for every input line, and the line arrives pre-split into fields. sed is for editing text in a stream: substitute this for that, delete these lines, print only that range. It has no arithmetic worth using and no real variables; what it has is a terse grammar for "on these lines, do this edit."

The trap with both is learning them as a list of tricks. Tricks evaporate under pressure. The models do not. awk has exactly one idea — a sequence of pattern { action } pairs applied to every line — and sed has exactly one idea too — an address that selects lines, and a command that edits them. Every one-liner you will ever see, including all fifteen on this page, is a direct application of one of those two sentences. So the plan here is: model first, one-liners second, and by the end the one-liners should feel like things you could have derived yourself.

The awk model

An awk program is a list of pattern { action } pairs. awk reads input one line at a time (it calls a line a record), and for each line it walks your list: if the pattern matches, the action runs. That is the whole engine. You never write the loop over lines; the loop is the language. Everything else is detail, and there are only five details that matter.

Fields. Before your code sees a line, awk splits it on whitespace. $1 is the first field, $2 the second, $0 the whole unsplit line. NF holds the number of fields on the current line, which makes $NF the last field — handy for paths and log lines where the interesting bit sits at the end. NR counts lines seen so far. The default split is on runs of spaces and tabs, with leading whitespace ignored, which is exactly the behaviour you want for command output where columns are padded to align. For other delimiters there is -F: awk -F, for CSV-ish data, awk -F: for /etc/passwd.

Patterns are filters. The pattern in front of the action can be a regex (/ERROR/), a comparison ($9 == 500), or any expression (NF > 3, NR % 2). If you omit the pattern, the action runs on every line. If you omit the action, the default action is print — print the matching line. That last rule is why awk '$9 == 500' is a complete program: pattern with no action, so matching lines get printed. It is grep, except the condition is a typed comparison on a column instead of a regex on the whole line.

BEGIN and END. Two special patterns run outside the per-line loop: BEGIN fires once before any input is read (set a separator, print a header) and END fires once after the last line (print the totals you accumulated). Almost every "summarise this file" one-liner is shaped like accumulate per line, report in END.

Associative arrays. awk's only data structure, and the one that earns it a permanent spot in your toolkit. count[$9]++ means "increment the counter stored under whatever key field 9 holds" — no declaration, no initialisation, missing keys spring into existence as zero. for (k in count) walks the keys in END. Group-by, histogram, dedupe, and join all fall out of this one feature.

Variables just work. s += $1 with no setup; uninitialised variables are 0 as numbers and "" as strings, and awk converts between the two based on how you use them. This permissiveness is what makes one-liners one line — and occasionally what makes them quietly wrong, which the pitfalls section returns to.

The execution model: BEGIN once, then line → fields → pattern gate → action for every line, then END once. You never write the loop.

The ten one-liners

Each of these is the model above wearing different clothes. Sample input for most of them is the common access log format, where field 1 is the client IP, field 7 the request path, field 9 the status code, and field 10 the response size in bytes.

1. Sum a column. The canonical accumulate-then-report shape. No pattern, so the action runs on every line; the total prints once at the end.

$ awk '{ s += $10 } END { print s }' access.log
48211904

2. Mean of a column. Same shape, two accumulators. The if (n) guard keeps an empty input from printing a divide-by-zero error instead of nothing.

$ awk '{ s += $10; n++ } END { if (n) print s / n }' access.log
4821.19

3. Count by key — the histogram. The associative-array idiom, and the single most useful line on this page. Read it as "tally field 9, then dump the tallies." Swap $9 for any field and you have a group-by on that column.

$ awk '{ count[$9]++ } END { for (k in count) print k, count[k] }' access.log
200 9212
301 187
404 560
500 41

4. Lines where a column clears a threshold. Pattern with no action, so matches print whole. This is grep with arithmetic: here, every response bigger than half a megabyte. Note the comparison is numeric because both sides look like numbers.

$ awk '$10 > 500000' access.log
10.0.4.12 - - [...] "GET /export/full.csv HTTP/1.1" 200 8123904

5. Extract a column from ragged whitespace. The reason this is an awk job and not a cut job: cut -d' ' treats every single space as a separator, so aligned columns padded with runs of spaces give it indigestion. awk's default splitting collapses the runs.

$ ps aux | awk '{ print $2, $11 }' | head -3
1 /sbin/init
412 /usr/lib/systemd/systemd-journald
1290 nginx:

6. Dedupe, keeping order. A famous line, worth decoding once so it stops being magic. seen[$0]++ returns the old value before incrementing: zero the first time a line appears, nonzero after. ! flips that, so the pattern is true only on first sight, and the default action prints. Unlike sort -u it needs no sort, so the input order survives.

$ awk '!seen[$0]++' hosts.txt

7. Everything between two patterns. A range pattern: true from the line matching the first regex through the line matching the second, inclusive. Good for extracting one stanza, one certificate, one section of a report.

$ awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/' bundle.pem

8. The line before a match. grep gives you context after a match easily; the line before is awk's party trick. Keep a one-line memory: the second pair saves every line into prev, the first prints the saved line whenever the current one matches. The order of the pairs matters — print first, then overwrite.

$ awk '/Traceback/ { print prev } { prev = $0 }' app.log
2026-06-08 10:14:02 worker=7 handling job 99182

9. Reorder columns. Print rebuilds the line in whatever order you name the fields. With a non-space delimiter, set -F for input and OFS in BEGIN for output, or the result comes out space-joined.

$ awk -F, 'BEGIN { OFS = "," } { print $3, $1, $2 }' export.csv

10. Sum a value per key. The histogram's bigger sibling: instead of counting lines per key, sum a field per key. Bytes per client IP, with the heaviest talkers sorted to the top by piping into sort — composing single-purpose tools like this is the subject of the pipeline.

$ awk '{ bytes[$1] += $10 } END { for (ip in bytes) print bytes[ip], ip }' access.log | sort -rn | head -3
19821044 10.0.9.55
8123904 10.0.4.12
1042113 10.0.7.30

The pattern behind the patterns. Lines 1, 2, 3, and 10 are all accumulate in the action, report in END. Lines 4, 6, and 7 are all pattern with no action, default print. Two shapes cover seven of the ten. When you face a new text question, ask which shape it is before you reach for the keyboard.

The sed model

sed is a stream editor: it reads a line into a buffer, applies your commands, prints the buffer, repeats. Every sed invocation is built from one grammar: address, then command. The address picks which lines; the command says what to do to them. No address means every line. The commands you will actually use fit on one hand: s substitutes, d deletes, p prints, i and a insert text before or after. An address can be a line number (1, $ for the last line), a regex (/^Port/), or a range joining two of either with a comma.

Every sed invocation parses the same way. Once you can read this one, you can read all of them.

Five one-liners cover most working days.

1. Substitute, everywhere. The g flag matters: without it, s replaces only the first match on each line, a default that has shipped more half-edited files than any other single character. The delimiter after s is whatever character you pick — when the pattern is full of slashes, use | and skip the backslash thicket.

$ sed 's|http://|https://|g' links.txt

2. Groups and the ampersand. $…$ captures part of the match for reuse as \1 in the replacement (with -E you get the saner (…) syntax). The bare & stands for the entire match — the quickest way to wrap matches without retyping them.

$ sed -E 's/([0-9]+)ms/\1 milliseconds/' timings.txt
$ sed 's/ERROR/[&]/' app.log
2026-06-08 10:14:03 [ERROR] payment gateway timeout

3. sed as grep that edits on the way out. -n turns off the default print; the p flag prints only lines where the substitution fired. The combination extracts and reshapes in one pass — here, pulling just the username out of matching lines. Plain grep can find the lines; it cannot rewrite them as it goes.

$ sed -n 's/.*user=\([^ ]*\).*/\1/p' auth.log
deploy
nilesh
deploy

4. A range of lines. /start/,/stop/ selects from the first line matching one regex through the next line matching the other. Pair it with p to extract a block or d to cut one out — here, removing a managed block from a config before regenerating it.

$ sed -n '/^# BEGIN managed/,/^# END managed/p' zone.conf
$ sed '/^# BEGIN managed/,/^# END managed/d' zone.conf

5. Edit the file in place — with a receipt. -i rewrites the file itself instead of printing to stdout. Give it a suffix and sed keeps the original under that name, which turns a destructive edit into a reversible one. Deleting and inserting lines rides the same grammar: a line-number address plus d, i, or a.

$ sed -i.bak 's/db-old\.internal/db-new.internal/g' app.conf
$ sed '1d' data.csv            # drop the header row
$ sed '1i\hostname,role,region' bare.csv   # insert a header above line 1

Both tools speak regex, and the dialect details (basic vs extended, what needs escaping where) are collected in the regex cheat sheet so this page does not have to relitigate them.

Three production scenarios

Log triage: which endpoints are throwing 5xx

An alert fires: elevated 500s on the public API. The access log has a few hundred thousand lines from the last hour and you want the failing endpoints ranked, not a scrolling wall of matches. One awk histogram with a regex pattern on the status field, sorted:

$ awk '$9 ~ /^5/ { c[$7]++ } END { for (u in c) print c[u], u }' access.log | sort -rn | head -5
389 /api/orders
41 /api/payments/charge
12 /api/orders/export
3 /healthz
1 /api/users

Read the pattern: $9 ~ /^5/ is "field 9 matches a regex anchored to start with 5," so 500, 502, 503 all count and 250-byte response sizes in some other column do not. The action tallies by request path; END dumps the tallies; sort -rn ranks them. Thirty seconds after the alert you know the bleeding is concentrated in /api/orders, which is a much better page to wake someone with than "errors are up." The same shape works on anything line-oriented, including output piped straight from journalctl.

Config surgery across forty files

The database moves and db-old.internal must become db-new.internal in every file under conf/ that mentions it. Opening forty files in an editor is how typos are born. The disciplined version is three steps: find the files, edit them in place, review the diff before anything ships.

$ grep -rl 'db-old\.internal' conf/ | wc -l
40
$ grep -rl 'db-old\.internal' conf/ | xargs sed -i.bak 's/db-old\.internal/db-new.internal/g'
$ git diff --stat | tail -1
40 files changed, 67 insertions(+), 67 deletions(-)

Two safety nets, deliberately redundant. The -i.bak suffix leaves a copy of every original next to it, and git holds the authoritative before-state, so git diff shows you exactly what the regex did — including the place where it matched something you did not intend, which you will catch now rather than in staging. Read the whole diff, not the stat line. When it looks right, delete the .bak files and commit. Escaping the dot in the pattern is not pedantry: unescaped, . matches any character, and db-oldXinternal in a comment would be silently rewritten too. The grep -rl | xargs half of this move, including what to do about filenames with spaces, gets full treatment in find & xargs.

A CSV into a quick report

Someone hands you sales.csv — date, region, sku, amount — and asks for revenue by region "real quick." This does not deserve a spreadsheet and certainly not a notebook. It is one group-by with a formatted footer:

$ head -2 sales.csv
date,region,sku,amount
2026-06-01,emea,SKU-114,1899.00
$ awk -F, 'NR > 1 { rev[$2] += $4; n[$2]++ } END { for (r in rev) printf "%-8s %12.2f %6d\n", r, rev[r], n[r] }' sales.csv | sort -k2 -rn
apac        812094.50   1204
emea        644210.00    981
amer        598332.25   1377

Three model pieces in one line: -F, sets the delimiter, the NR > 1 pattern skips the header row, and two arrays accumulate sum and count per region. printf aligns the output into columns a human can scan, and sort -k2 -rn orders by the revenue column. One honest caveat before you ship this anywhere durable: -F, splits on every comma, including commas inside quoted fields like "Portland, OR". For a known-clean export it is fine; for arbitrary CSV it is a trap, and the next section is about exactly that line.

When not to use them

The honest boundary: awk and sed are line tools, and they reward you exactly as long as the data is truly line-shaped and the state you carry is small. Structured formats fail the first test. JSON is not lines — the same object can arrive pretty-printed across twenty lines or minified onto one, keys come in any order, and strings may contain the very brackets and commas your regex keys on. A sed pattern that extracts a JSON field works until the day the producer reorders keys, and then it fails silently, which is the worst way to fail. Use jq, which parses the structure instead of pattern-matching its shadow. The same logic sends YAML to yq and real CSV (quoted fields, embedded commas) to a parser that knows the quoting rules.

The second test is state. One variable of lookback, like the print-the-previous-line trick, is fine. The moment you are juggling a flag and a counter and a buffer across a multi-line record — matching opening and closing braces, stitching continuation lines, correlating a request line with a response line forty lines later — the one-liner has become a program that happens to be unreadable. Write the twenty lines of Python. It will have a name, live in version control, survive a reviewer, and still work when the requirement grows a second clause, which it will. The rule of thumb that has aged well: if the awk program needs a second pattern-action pair and a variable you had to think about, it is allowed; at three of either, it wants to be a file with a name.

Pitfalls

GNU sed vs BSD sed, the classic. On Linux (GNU sed), sed -i 's/a/b/' f edits in place with no backup. On macOS and the BSDs, -i requires a suffix argument, so the same command eats s/a/b/ as the suffix and then fails — or worse, half-works confusingly. The portable habit is to always pass a suffix (-i.bak, no space, works on both) and delete the backups after review. If you genuinely need no backup on macOS, it is sed -i '' 's/a/b/' f with an explicit empty string. Every engineer hits this once; now you have hit it here instead of in a deploy script.

Assuming the default field split when the data is delimited. awk's whitespace splitting is right for command output and wrong for anything with a real delimiter. /etc/passwd without -F: gives you one field per line. Worse is the half-failure: a tab-separated file where most values contain no spaces mostly works, until one value does and your column numbering shifts for just those rows. If the data has a delimiter, say so with -F, every time.

Locale surprises. Regex ranges and sort order both consult the locale. In some locales [a-z] can match uppercase letters too, and sort orders case-insensitively, so a pipeline that worked on your laptop misbehaves on a box with a different LANG. For predictable byte-wise behaviour in scripts, prefix the command with LC_ALL=C — it is also noticeably faster on large files, since the collation logic gets out of the way.

Greedy matching grabs more than you meant. Both tools use regexes where .* takes the longest possible match, and there is no .*? non-greedy escape hatch as in Python or Perl. s/<.*>// on a line with two tags deletes from the first < to the last >, taking the text between the tags with it. The idiomatic fix is a negated class that cannot run past the closer: s/<[^>]*>//g. When a substitution removes more than expected, greed is the first suspect.

Editing the file you are reading from. awk '...' data.txt > data.txt destroys the file: the shell truncates data.txt for the redirect before awk reads a byte, so awk opens an empty file and the original is gone. Write to a temp file and move it over, or use sed -i, which does that dance internally (a detail with a side effect worth knowing: the file is replaced, so its inode changes and hard links break). GNU awk offers -i inplace, but the temp-file-and-mv habit is the one that works everywhere.

A drill you can run right now

Everything below happens in /tmp on three lines of fake data, so it is safe on any machine including a shared one. Ten minutes, and the histogram, the sum, and the substitute-with-backup move stop being things you read about today.

Step 1 — build the log. Three echoes, three lines in the access-log shape used throughout this page:

$ echo '10.0.4.12 - - [08/Jun/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 5120' > /tmp/drill.log
$ echo '10.0.9.55 - - [08/Jun/2026:10:00:02 +0000] "GET /api/orders HTTP/1.1" 500 312' >> /tmp/drill.log
$ echo '10.0.4.12 - - [08/Jun/2026:10:00:03 +0000] "POST /api/orders HTTP/1.1" 200 2048' >> /tmp/drill.log

Before running anything, predict the field numbers. Count the whitespace-separated pieces of the first line and convince yourself the status code is $9 and the byte count is $10. (The quoted request is three fields to awk, because awk knows nothing about quotes — another small honest limit.)

Step 2 — the histogram. Tally the status codes and check the output against what you can see by eye in three lines:

$ awk '{ c[$9]++ } END { for (k in c) print k, c[k] }' /tmp/drill.log
200 2
500 1

Step 3 — the sum. Total the response bytes, and verify the arithmetic yourself: 5120 + 312 + 2048.

$ awk '{ s += $10 } END { print s }' /tmp/drill.log
7480

Step 4 — the substitution, with a receipt. Rewrite the API prefix in place, keep a backup, and diff the two to see exactly what changed:

$ sed -i.bak 's|/api/|/v2/api/|' /tmp/drill.log
$ diff /tmp/drill.log.bak /tmp/drill.log | head -4
1c1
< 10.0.4.12 - - [08/Jun/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 5120
---
> 10.0.4.12 - - [08/Jun/2026:10:00:01 +0000] "GET /v2/api/users HTTP/1.1" 200 5120
$ rm /tmp/drill.log /tmp/drill.log.bak

Walk back through what you used: the histogram and the sum were both accumulate, then report in END; the substitution was address (every line), command (s), with an alternate delimiter because the pattern was full of slashes; and the .bak plus diff pair is the same review-before-trust habit that scales up to the forty-file config surgery. Three lines of fake log, but the moves are identical at three hundred thousand.

If you remember one line each. For awk, the histogram: awk '{ c[$KEY]++ } END { for (k in c) print k, c[k] }' — half of all log questions are this line with a different field number. For sed, the safe edit: sed -i.bak 's/old/new/g' file, then read the diff before you delete the backup.

awk & sed

The question they answer

The awk model

The ten one-liners

The sed model

Three production scenarios

Log triage: which endpoints are throwing 5xx

Config surgery across forty files

A CSV into a quick report

When not to use them

Pitfalls

A drill you can run right now

Further reading

24 — The pipeline