awk & sed
You have ten thousand lines of log and one question: which endpoint is throwing the 500s?
The answer is in there, spread across a column, and writing a Python script for it feels
like renting a crane to move a chair. This is the job awk and sed
were built for. awk is a tiny programming model that runs once per line;
sed is an editor that edits once per line. Learn the two models — not a
grab-bag of memorised incantations — and most text questions collapse into one line you
can type from memory. This page teaches the models, annotates the one-liners worth
keeping, walks three production jobs, and ends with a drill on a fake log you build yourself.
The question they answer
Both tools answer the same recurring question: how do I pull the answer out of this text
without writing a script? The text might be an access log, the output of ps,
a CSV someone exported from a dashboard, a config file, or whatever
journalctl just handed you. The
answer might be a sum, a histogram, one column out of nine, or the same file with one
string swapped for another. You could open an editor. You could write Python. But during
an incident, or in the middle of a pipeline, you want the answer in the time it takes to
type a line.
The two tools split the territory. awk is for computing over text:
sums, counts, filters on a column's value, grouping by a key. It is a real programming
language, but one with an unusual shape — your program runs once for every input line,
and the line arrives pre-split into fields. sed is for editing text
in a stream: substitute this for that, delete these lines, print only that range. It has
no arithmetic worth using and no real variables; what it has is a terse grammar for
"on these lines, do this edit."
The trap with both is learning them as a list of tricks. Tricks evaporate under pressure.
The models do not. awk has exactly one idea — a sequence of
pattern { action } pairs applied to every line — and
sed has exactly one idea too — an address that selects lines, and a command
that edits them. Every one-liner you will ever see, including all fifteen on this page, is
a direct application of one of those two sentences. So the plan here is: model first,
one-liners second, and by the end the one-liners should feel like things you could have
derived yourself.
The awk model
An awk program is a list of pattern { action } pairs. awk reads
input one line at a time (it calls a line a record), and for each line it walks
your list: if the pattern matches, the action runs. That is the whole engine. You never
write the loop over lines; the loop is the language. Everything else is detail, and there
are only five details that matter.
Fields. Before your code sees a line, awk splits it on whitespace.
$1 is the first field, $2 the second, $0 the whole
unsplit line. NF holds the number of fields on the current line, which makes
$NF the last field — handy for paths and log lines where the interesting bit
sits at the end. NR counts lines seen so far. The default split is on
runs of spaces and tabs, with leading whitespace ignored, which is exactly the
behaviour you want for command output where columns are padded to align. For other
delimiters there is -F: awk -F, for CSV-ish data,
awk -F: for /etc/passwd.
Patterns are filters. The pattern in front of the action can be a regex
(/ERROR/), a comparison ($9 == 500), or any expression
(NF > 3, NR % 2). If you omit the pattern, the action runs
on every line. If you omit the action, the default action is print — print
the matching line. That last rule is why awk '$9 == 500' is a complete
program: pattern with no action, so matching lines get printed. It is grep, except the
condition is a typed comparison on a column instead of a regex on the whole line.
BEGIN and END. Two special patterns run outside the per-line loop:
BEGIN fires once before any input is read (set a separator, print a header)
and END fires once after the last line (print the totals you accumulated).
Almost every "summarise this file" one-liner is shaped like
accumulate per line, report in END.
Associative arrays. awk's only data structure, and the one that earns it
a permanent spot in your toolkit. count[$9]++ means "increment the counter
stored under whatever key field 9 holds" — no declaration, no initialisation, missing
keys spring into existence as zero. for (k in count) walks the keys in END.
Group-by, histogram, dedupe, and join all fall out of this one feature.
Variables just work. s += $1 with no setup; uninitialised
variables are 0 as numbers and "" as strings, and awk converts
between the two based on how you use them. This permissiveness is what makes one-liners
one line — and occasionally what makes them quietly wrong, which the pitfalls section
returns to.
The ten one-liners
Each of these is the model above wearing different clothes. Sample input for most of them is the common access log format, where field 1 is the client IP, field 7 the request path, field 9 the status code, and field 10 the response size in bytes.
1. Sum a column. The canonical accumulate-then-report shape. No pattern, so the action runs on every line; the total prints once at the end.
$ awk '{ s += $10 } END { print s }' access.log 48211904
2. Mean of a column. Same shape, two accumulators. The if (n)
guard keeps an empty input from printing a divide-by-zero error instead of nothing.
$ awk '{ s += $10; n++ } END { if (n) print s / n }' access.log 4821.19
3. Count by key — the histogram. The associative-array idiom, and the
single most useful line on this page. Read it as "tally field 9, then dump the tallies."
Swap $9 for any field and you have a group-by on that column.
$ awk '{ count[$9]++ } END { for (k in count) print k, count[k] }' access.log 200 9212 301 187 404 560 500 41
4. Lines where a column clears a threshold. Pattern with no action, so matches print whole. This is grep with arithmetic: here, every response bigger than half a megabyte. Note the comparison is numeric because both sides look like numbers.
$ awk '$10 > 500000' access.log 10.0.4.12 - - [...] "GET /export/full.csv HTTP/1.1" 200 8123904
5. Extract a column from ragged whitespace. The reason this is an awk
job and not a cut job: cut -d' ' treats every single space as a
separator, so aligned columns padded with runs of spaces give it indigestion. awk's
default splitting collapses the runs.
$ ps aux | awk '{ print $2, $11 }' | head -3 1 /sbin/init 412 /usr/lib/systemd/systemd-journald 1290 nginx:
6. Dedupe, keeping order. A famous line, worth decoding once so it stops
being magic. seen[$0]++ returns the old value before incrementing:
zero the first time a line appears, nonzero after. ! flips that, so the
pattern is true only on first sight, and the default action prints. Unlike
sort -u it needs no sort, so the input order survives.
$ awk '!seen[$0]++' hosts.txt
7. Everything between two patterns. A range pattern: true from the line matching the first regex through the line matching the second, inclusive. Good for extracting one stanza, one certificate, one section of a report.
$ awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/' bundle.pem
8. The line before a match. grep gives you context after a match easily;
the line before is awk's party trick. Keep a one-line memory: the second pair
saves every line into prev, the first prints the saved line whenever the
current one matches. The order of the pairs matters — print first, then overwrite.
$ awk '/Traceback/ { print prev } { prev = $0 }' app.log 2026-06-08 10:14:02 worker=7 handling job 99182
9. Reorder columns. Print rebuilds the line in whatever order you name
the fields. With a non-space delimiter, set -F for input and OFS
in BEGIN for output, or the result comes out space-joined.
$ awk -F, 'BEGIN { OFS = "," } { print $3, $1, $2 }' export.csv
10. Sum a value per key. The histogram's bigger sibling: instead of
counting lines per key, sum a field per key. Bytes per client IP, with the heaviest
talkers sorted to the top by piping into sort — composing single-purpose
tools like this is the subject of the
pipeline.
$ awk '{ bytes[$1] += $10 } END { for (ip in bytes) print bytes[ip], ip }' access.log | sort -rn | head -3 19821044 10.0.9.55 8123904 10.0.4.12 1042113 10.0.7.30
The sed model
sed is a stream editor: it reads a line into a buffer, applies your commands, prints the
buffer, repeats. Every sed invocation is built from one grammar:
address, then command. The address picks which lines; the command says what to
do to them. No address means every line. The commands you will actually use fit on one
hand: s substitutes, d deletes, p prints,
i and a insert text before or after. An address can be a line
number (1, $ for the last line), a regex
(/^Port/), or a range joining two of either with a comma.
Five one-liners cover most working days.
1. Substitute, everywhere. The g flag matters: without it,
s replaces only the first match on each line, a default that has shipped
more half-edited files than any other single character. The delimiter after
s is whatever character you pick — when the pattern is full of slashes, use
| and skip the backslash thicket.
$ sed 's|http://|https://|g' links.txt
2. Groups and the ampersand. \(…\) captures part of the
match for reuse as \1 in the replacement (with -E you get the
saner (…) syntax). The bare & stands for the entire match —
the quickest way to wrap matches without retyping them.
$ sed -E 's/([0-9]+)ms/\1 milliseconds/' timings.txt $ sed 's/ERROR/[&]/' app.log 2026-06-08 10:14:03 [ERROR] payment gateway timeout
3. sed as grep that edits on the way out. -n turns off the
default print; the p flag prints only lines where the substitution fired.
The combination extracts and reshapes in one pass — here, pulling just the username out
of matching lines. Plain grep can find the lines; it cannot rewrite them as it goes.
$ sed -n 's/.*user=\([^ ]*\).*/\1/p' auth.log deploy nilesh deploy
4. A range of lines. /start/,/stop/ selects from the first
line matching one regex through the next line matching the other. Pair it with
p to extract a block or d to cut one out — here, removing a
managed block from a config before regenerating it.
$ sed -n '/^# BEGIN managed/,/^# END managed/p' zone.conf $ sed '/^# BEGIN managed/,/^# END managed/d' zone.conf
5. Edit the file in place — with a receipt. -i rewrites the
file itself instead of printing to stdout. Give it a suffix and sed keeps the original
under that name, which turns a destructive edit into a reversible one. Deleting and
inserting lines rides the same grammar: a line-number address plus d,
i, or a.
$ sed -i.bak 's/db-old\.internal/db-new.internal/g' app.conf $ sed '1d' data.csv # drop the header row $ sed '1i\hostname,role,region' bare.csv # insert a header above line 1
Both tools speak regex, and the dialect details (basic vs extended, what needs escaping where) are collected in the regex cheat sheet so this page does not have to relitigate them.
Three production scenarios
Log triage: which endpoints are throwing 5xx
An alert fires: elevated 500s on the public API. The access log has a few hundred thousand lines from the last hour and you want the failing endpoints ranked, not a scrolling wall of matches. One awk histogram with a regex pattern on the status field, sorted:
$ awk '$9 ~ /^5/ { c[$7]++ } END { for (u in c) print c[u], u }' access.log | sort -rn | head -5 389 /api/orders 41 /api/payments/charge 12 /api/orders/export 3 /healthz 1 /api/users
Read the pattern: $9 ~ /^5/ is "field 9 matches a regex anchored to start
with 5," so 500, 502, 503 all count and 250-byte response sizes in some other column do
not. The action tallies by request path; END dumps the tallies; sort -rn
ranks them. Thirty seconds after the alert you know the bleeding is concentrated in
/api/orders, which is a much better page to wake someone with than "errors
are up." The same shape works on anything line-oriented, including output piped straight
from journalctl.
Config surgery across forty files
The database moves and db-old.internal must become
db-new.internal in every file under conf/ that mentions it.
Opening forty files in an editor is how typos are born. The disciplined version is three
steps: find the files, edit them in place, review the diff before anything ships.
$ grep -rl 'db-old\.internal' conf/ | wc -l 40 $ grep -rl 'db-old\.internal' conf/ | xargs sed -i.bak 's/db-old\.internal/db-new.internal/g' $ git diff --stat | tail -1 40 files changed, 67 insertions(+), 67 deletions(-)
Two safety nets, deliberately redundant. The -i.bak suffix leaves a copy of
every original next to it, and git holds the authoritative before-state, so
git diff shows you exactly what the regex did — including the place where it
matched something you did not intend, which you will catch now rather than in
staging. Read the whole diff, not the stat line. When it looks right, delete the
.bak files and commit. Escaping the dot in the pattern is not pedantry:
unescaped, . matches any character, and db-oldXinternal in a
comment would be silently rewritten too. The grep -rl | xargs half of this
move, including what to do about filenames with spaces, gets full treatment in
find & xargs.
A CSV into a quick report
Someone hands you sales.csv — date, region, sku, amount — and asks for
revenue by region "real quick." This does not deserve a spreadsheet and certainly not a
notebook. It is one group-by with a formatted footer:
$ head -2 sales.csv date,region,sku,amount 2026-06-01,emea,SKU-114,1899.00 $ awk -F, 'NR > 1 { rev[$2] += $4; n[$2]++ } END { for (r in rev) printf "%-8s %12.2f %6d\n", r, rev[r], n[r] }' sales.csv | sort -k2 -rn apac 812094.50 1204 emea 644210.00 981 amer 598332.25 1377
Three model pieces in one line: -F, sets the delimiter, the
NR > 1 pattern skips the header row, and two arrays accumulate sum and
count per region. printf aligns the output into columns a human can scan,
and sort -k2 -rn orders by the revenue column. One honest caveat before you
ship this anywhere durable: -F, splits on every comma, including commas
inside quoted fields like "Portland, OR". For a known-clean export it is
fine; for arbitrary CSV it is a trap, and the next section is about exactly that line.
When not to use them
The honest boundary: awk and sed are line tools, and they reward you exactly as long as
the data is truly line-shaped and the state you carry is small. Structured formats fail
the first test. JSON is not lines — the same object can arrive pretty-printed across
twenty lines or minified onto one, keys come in any order, and strings may contain the
very brackets and commas your regex keys on. A sed pattern that extracts a JSON field
works until the day the producer reorders keys, and then it fails silently, which is the
worst way to fail. Use jq, which parses the
structure instead of pattern-matching its shadow. The same logic sends YAML to
yq and real CSV (quoted fields, embedded commas) to a parser that knows the
quoting rules.
The second test is state. One variable of lookback, like the print-the-previous-line trick, is fine. The moment you are juggling a flag and a counter and a buffer across a multi-line record — matching opening and closing braces, stitching continuation lines, correlating a request line with a response line forty lines later — the one-liner has become a program that happens to be unreadable. Write the twenty lines of Python. It will have a name, live in version control, survive a reviewer, and still work when the requirement grows a second clause, which it will. The rule of thumb that has aged well: if the awk program needs a second pattern-action pair and a variable you had to think about, it is allowed; at three of either, it wants to be a file with a name.
Pitfalls
GNU sed vs BSD sed, the classic. On Linux (GNU sed),
sed -i 's/a/b/' f edits in place with no backup. On macOS and the BSDs,
-i requires a suffix argument, so the same command eats
s/a/b/ as the suffix and then fails — or worse, half-works confusingly. The
portable habit is to always pass a suffix (-i.bak, no space, works on both)
and delete the backups after review. If you genuinely need no backup on macOS, it is
sed -i '' 's/a/b/' f with an explicit empty string. Every engineer hits
this once; now you have hit it here instead of in a deploy script.
Assuming the default field split when the data is delimited. awk's
whitespace splitting is right for command output and wrong for anything with a real
delimiter. /etc/passwd without -F: gives you one field per
line. Worse is the half-failure: a tab-separated file where most values contain no
spaces mostly works, until one value does and your column numbering shifts for just
those rows. If the data has a delimiter, say so with -F, every time.
Locale surprises. Regex ranges and sort order both consult the locale.
In some locales [a-z] can match uppercase letters too, and
sort orders case-insensitively, so a pipeline that worked on your laptop
misbehaves on a box with a different LANG. For predictable byte-wise
behaviour in scripts, prefix the command with LC_ALL=C — it is also
noticeably faster on large files, since the collation logic gets out of the way.
Greedy matching grabs more than you meant. Both tools use regexes where
.* takes the longest possible match, and there is no
.*? non-greedy escape hatch as in Python or Perl.
s/<.*>// on a line with two tags deletes from the first
< to the last >, taking the text between the
tags with it. The idiomatic fix is a negated class that cannot run past the closer:
s/<[^>]*>//g. When a substitution removes more than expected,
greed is the first suspect.
Editing the file you are reading from. awk '...' data.txt > data.txt destroys the file: the shell truncates
data.txt for the redirect before awk reads a byte, so awk opens an
empty file and the original is gone. Write to a temp file and move it over, or use
sed -i, which does that dance internally (a detail with a side effect worth
knowing: the file is replaced, so its inode changes and hard links break). GNU awk
offers -i inplace, but the temp-file-and-mv habit is the one
that works everywhere.
A drill you can run right now
Everything below happens in /tmp on three lines of fake data, so it is safe
on any machine including a shared one. Ten minutes, and the histogram, the sum, and the
substitute-with-backup move stop being things you read about today.
Step 1 — build the log. Three echoes, three lines in the access-log shape used throughout this page:
$ echo '10.0.4.12 - - [08/Jun/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 5120' > /tmp/drill.log $ echo '10.0.9.55 - - [08/Jun/2026:10:00:02 +0000] "GET /api/orders HTTP/1.1" 500 312' >> /tmp/drill.log $ echo '10.0.4.12 - - [08/Jun/2026:10:00:03 +0000] "POST /api/orders HTTP/1.1" 200 2048' >> /tmp/drill.log
Before running anything, predict the field numbers. Count the whitespace-separated
pieces of the first line and convince yourself the status code is $9 and
the byte count is $10. (The quoted request is three fields to awk,
because awk knows nothing about quotes — another small honest limit.)
Step 2 — the histogram. Tally the status codes and check the output against what you can see by eye in three lines:
$ awk '{ c[$9]++ } END { for (k in c) print k, c[k] }' /tmp/drill.log 200 2 500 1
Step 3 — the sum. Total the response bytes, and verify the arithmetic yourself: 5120 + 312 + 2048.
$ awk '{ s += $10 } END { print s }' /tmp/drill.log 7480
Step 4 — the substitution, with a receipt. Rewrite the API prefix in place, keep a backup, and diff the two to see exactly what changed:
$ sed -i.bak 's|/api/|/v2/api/|' /tmp/drill.log $ diff /tmp/drill.log.bak /tmp/drill.log | head -4 1c1 < 10.0.4.12 - - [08/Jun/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 5120 --- > 10.0.4.12 - - [08/Jun/2026:10:00:01 +0000] "GET /v2/api/users HTTP/1.1" 200 5120 $ rm /tmp/drill.log /tmp/drill.log.bak
Walk back through what you used: the histogram and the sum were both
accumulate, then report in END; the substitution was address (every line),
command (s), with an alternate delimiter because the pattern was full of slashes;
and the .bak plus diff pair is the same review-before-trust
habit that scales up to the forty-file config surgery. Three lines of fake log, but the
moves are identical at three hundred thousand.
awk '{ c[$KEY]++ } END { for (k in c) print k, c[k] }' —
half of all log questions are this line with a different field number. For sed, the safe
edit: sed -i.bak 's/old/new/g' file, then read the diff before you delete
the backup.Further reading
- The GNU awk manual — the reference; the "Getting Started" and "Arrays" chapters cover everything this page used and a little more.
- The GNU sed manual — short by manual standards; the sections on addresses and the s command repay a careful read once.
- The AWK Programming Language — Aho, Kernighan, Weinberger — the book by the A, W, and K themselves; still the best argument that small languages age well.
- Eric Pement — handy one-line scripts for sed — the famous collection; read it after you know the model and each entry decodes itself.
- Semicolony — the regex cheat sheet — the dialect details (basic vs extended, escaping rules) both tools depend on.