11 / 28

Linux / 11

find & xargs

The disk filled overnight and you need every file bigger than 100 MB written in the last day. A build tree has eighteen thousand stale artifacts and you need them gone, but only the ones older than a month, and you would very much like to not delete the wrong ones. These are the same shape of problem: which files match, and how do I act on all of them safely? find answers the first half — it selects. xargs (or find's own -exec) answers the second — it acts. This page covers the predicates worth memorising, decodes the -mtime arithmetic everyone gets backwards, walks three production scenarios, explains why xargs exists at all, and ends with a drill in /tmp that cannot hurt anything.

The question they answer

find walks a directory tree and tests every entry it meets against a list of predicates: name patterns, file types, ages, sizes, owners, permissions. Whatever passes the tests gets printed — or handed to an action. That is the entire tool. The grammar takes an evening to stop fighting you, but the model is one sentence: walk everything under this path, keep what matches, do something with the survivors.

The "do something" half is where the second tool comes in. find can run a command itself with -exec, and for many jobs that is the right call. But when the file list is huge, when you want the work parallelised, or when the matching and the acting belong in separate steps so you can inspect the list in between, you pipe the paths to xargs. Its job sounds almost too small to need a program: read items from standard input, pack them onto the end of a command line, and run that command — as many times as it takes to consume the input. The reason that job needs a program at all is a kernel limit called ARG_MAX, and we will get to it, because it explains a class of "Argument list too long" failures that otherwise look like the shell is broken.

Keeping the two roles separate is the discipline this page keeps returning to. find selects; xargs or -exec acts. The selection step is read-only and you can run it as many times as you like, staring at the list until it is exactly the set of files you mean. Only then do you bolt an action onto it. Every horror story about find ... -delete taking out a production directory is a story about someone skipping the staring step.

The predicates and flags that matter

Like most tools from this era, find has a man page you could read for a week. The working set is much smaller. These are the predicates and flags that cover nearly all of the daily and on-call work.

Predicate / flag	What it does	The thing to remember
`-name '*.log'`	Matches the file's basename against a glob; `-iname` is the case-insensitive twin	Quote the glob, always, or the shell eats it before find sees it
`-path '/cache/'`	Matches the whole path, not just the basename	For skipping trees, pair it with `-prune` rather than filtering after the walk
`-type f` / `-type d`	Regular files only, or directories only	Without it you match directories too, and actions like `rm` start failing mid-run
`-mtime -1` / `+7` / `7`	Modified time, measured in whole 24-hour periods	The +/−/exact arithmetic is the part everyone gets backwards — decoder below
`-mmin -30`	Same idea in minutes	For "what changed in the last half hour" during an incident
`-size +100M`	Files larger than 100 MB	Units matter: `+100M` is megabytes, bare `+100` is 512-byte blocks
`-exec cmd {} \;`	Runs cmd once per matching file	One fork+exec per file — fine for ten files, painful for ten thousand
`-exec cmd {} +`	Runs cmd with as many files as fit on one command line	Batched like xargs, but `{}` must sit at the end — no flags after it
`-print0` … `xargs -0`	Separates paths with NUL bytes instead of newlines	The contract that makes filenames with spaces survive the pipe; the two halves travel together
`xargs -P 8 -n 16`	Up to 8 commands in parallel, 16 items each	Free parallelism for anything per-file: grep, gzip, image conversion

A few notes before the decoder. The difference between -name and -path trips people because the glob looks the same: -name '*.conf' tests only the last path component, so -name 'etc/*.conf' silently matches nothing, ever — basenames do not contain slashes. When you want to match on where a file lives rather than what it is called, that is -path. And when what you actually want is to not descend into a directory at all (a node_modules, a .git), filtering the output is the slow way; -path '*/node_modules' -prune -o ... -print tells the walk to skip the subtree entirely, which on a big tree is the difference between seconds and minutes.

The two -exec terminators look like punctuation trivia and are anything but. \; means "run the command now, with this one file" — one process per match. + means "collect matches and run the command with a big batch of them," the same batching xargs does, implemented inside find. For commands that accept many file arguments (most of them: rm, grep, chmod, gzip), the + form is dramatically cheaper, and the timing block in the next section puts a number on it. The one structural limit of + is that {} has to be the last thing before it. -exec mv {} /backup + is a syntax error, because find has nowhere to splice the batch. GNU mv's -t /backup flag works around it by moving the destination to the front, or you hand the job to xargs, which has no such restriction on where its own arguments go.

The -mtime decoder. find measures age in whole 24-hour periods, rounding down. A file modified 7 days and 9 hours ago has an integer age of 7. -mtime 7 matches files whose integer age is exactly 7 — modified between 7 and 8 days ago. -mtime +7 means strictly greater than 7, which is age 8 and up: older than 8 full days, not 7. -mtime -7 means strictly less: younger than 7 full days. The surprise hiding in there: -mtime +7 and -mtime -7 together do not cover everything — the files in the exact-7 bucket match neither.

Age runs right to left, in whole 24-hour buckets. -7 is everything younger than the 7-day mark, +7 is everything older than the 8-day mark, and bare 7 is the one-day bucket in between that both of the others skip.

Reading the output

Three short sessions, each making one of the abstract claims above concrete. First, the -mtime arithmetic against real dates. Imagine it is two in the afternoon on June 8th and an app directory holds a week and a half of rotated logs:

$ date
Mon Jun  8 14:00:00 UTC 2026
$ find /var/log/app -name '*.log*' -mtime +7
/var/log/app/server.log.10   ← mtime May 29, age 10 days: 10 > 7, matches
/var/log/app/server.log.9    ← mtime May 31 03:00, age 8 days 11h → integer age 8, matches
$ find /var/log/app -name '*.log*' -mtime 7
/var/log/app/server.log.8    ← age 7 days 16h → floor is 7: the exact bucket, missed by both +7 and -7
$ find /var/log/app -name '*.log*' -mtime -7
/var/log/app/server.log      ← written 2 hours ago
/var/log/app/server.log.1    ← …through .7: everything younger than 168 hours

Notice server.log.8. It is seven and a half days old — "older than a week" by any human reading — and -mtime +7 does not return it, because its integer age is 7, not greater than 7. If your retention policy says "delete after seven days," the predicate that implements it is -mtime +6, and an uncomfortable number of cron jobs in the wild are quietly keeping one extra day of files because someone wrote +7.

Second, the cost of -exec ... \; versus -exec ... +. The work here is trivial (touching files); all the time in the first run goes into process creation:

$ find build/tmp -name '*.o' | wc -l
18422
$ time find build/tmp -name '*.o' -exec touch {} \;
real    0m38.114s    ← 18,422 fork+exec cycles — one process per file
$ time find build/tmp -name '*.o' -exec touch {} +
real    0m0.681s     ← same files, 2 processes — each argv packed near the limit

Fifty-odd times faster, for one character of difference. The lesson generalises: any time a command can take many files at once, batch. The \; form still earns its keep when the command genuinely needs one file at a time — say, a script that takes exactly one argument — or when you want a placeholder somewhere other than the end, which + forbids.

Third, the -print0 contract, shown at the byte level. By default find separates paths with newlines, and by default xargs splits its input on any whitespace — so a file called release notes.cfg arrives at the command as two arguments, release and notes.cfg, both wrong. NUL bytes fix it, because NUL is the one byte a path can never contain. Run the find through od -c and you can see the separators:

$ find . -name '*.cfg' -print0 | od -c
0000000   .   /   r   e   l   e   a   s   e  _   n   o   t   e   s   .
0000020   c   f   g  \0   .   /   d   b   .   c   f   g  \0
0000035
  the space inside the filename (shown as _) is just another byte now;
  only the \0 bytes are separators, and xargs -0 splits only on those
$ find . -name '*.cfg' -print0 | xargs -0 ls -l
-rw-r--r-- 1 deploy deploy  412 Jun  8 09:12 ./db.cfg
-rw-r--r-- 1 deploy deploy 1187 Jun  7 18:40 ./release notes.cfg

The two halves are a pair: -print0 on the producing side, -0 on the consuming side. One without the other is worse than neither — NUL-separated output fed to a plain xargs looks like one gigantic argument. Make the pairing a reflex and filenames with spaces, quotes, and even embedded newlines (which are legal, regrettably) stop being able to hurt you.

The select-then-act pipeline. find emits a NUL-separated stream of survivors; xargs packs them into as few argv loads as the ARG_MAX budget allows, and each load becomes one process.

Three production scenarios

What filled the disk since yesterday

The volume alert fires at 91% and the graph shows the climb started last night. You want one list: every large file written in the last day, with sizes, so you can see the culprit instead of guessing at it.

$ sudo find / -xdev -type f -mtime -1 -size +100M -exec ls -lh {} +
-rw-r--r-- 1 deploy deploy  18G Jun  8 03:12 /var/log/app/debug.log
-rw-r--r-- 1 root   root   1.2G Jun  8 02:40 /var/tmp/core.41327
-rw-r--r-- 1 deploy deploy 640M Jun  8 06:55 /srv/uploads/batch-export.csv

Reading the selection left to right: -xdev stops the walk from crossing onto other mounted filesystems, so a search of / does not wander into /proc, a network mount, or a second disk that is not the one alerting. -type f keeps directories out of the list. -mtime -1 is the decoder in action: modified within the last 24 hours. -size +100M cuts the noise — when a disk fills overnight, the cause is nearly always a handful of big files, not a million tiny ones. The -exec ls -lh {} + tail turns a bare path list into something with sizes and timestamps you can reason about. An 18 GB debug.log written at 3am answers the question by itself. If the numbers do not add up — df says full but no file accounts for it — the disk is hiding space somewhere find cannot see, and that investigation is the subject of why is the disk full?

Cleaning old artifacts without an incident

A build cache has been accreting tarballs for a year and it is time to drop everything older than thirty days. This is the scenario where the select-then-act seam earns its keep, because the action is destructive and find will do exactly what you typed, not what you meant. The discipline: run the selection with -print first, read the list, count it, and only then swap the print for the delete.

$ find /var/cache/builds -type f -name '*.tar.gz' -mtime +30 -print | head
/var/cache/builds/app-2026-04-11.tar.gz
/var/cache/builds/app-2026-04-12.tar.gz
…
$ find /var/cache/builds -type f -name '*.tar.gz' -mtime +30 -print | wc -l
212                          ← does 212 match what you expected? if not, stop here
$ find /var/cache/builds -type f -name '*.tar.gz' -mtime +30 -delete

The dry run costs ten seconds and removes the entire class of "I deleted the wrong things" failures, because the final command is the same selection with one word changed. Resist the urge to retype the command from scratch for the delete; edit the line you already verified. Two properties of -delete worth knowing before you use it in anger: it quietly switches find into depth-first order (children before parents, so directories can empty out before find reaches them), and it refuses to remove non-empty directories — for files it behaves like rm, for directories like rmdir. And it is positional, which is a pitfall big enough to get its own entry below.

Grepping a huge tree, fast

You need every file mentioning a deprecated config key across a repository with a hundred thousand files. A bare recursive grep works, but it runs on one core while the other fifteen watch. The find side selects just the files worth searching; the xargs side fans the work out:

$ find . -path './vendor' -prune -o -type f -name '*.go' -print0 \
    | xargs -0 -P 8 grep -l 'legacy_timeout'
./internal/api/server.go
./internal/api/middleware.go
./cmd/worker/config.go

-prune keeps the walk out of the vendored tree entirely, -print0 | xargs -0 is the spaces contract, and -P 8 runs eight greps at once, each chewing through its own batch. One small trap in this specific pipeline: grep only prefixes matches with filenames when it is given more than one file, and xargs makes no promise about batch sizes — the last batch can be a single file. Add -H to grep (or /dev/null as an extra argument) and the output format stays stable. The honest footnote is that this entire pipeline is what ripgrep does in one command, with the parallelism, the pruning, and the gitignore handling built in. On your laptop, use rg. The find/xargs version is for the machines that matter at the worst times — minimal containers, stripped-down AMIs, someone else's production box — where the pre-installed tools are the only tools, and it is also the general form: swap grep for gzip, chmod, or an image converter and the same pipeline parallelises that instead.

What happens underneath

A find over a big tree is a syscall story. For every directory, find calls readdir to list the entries; for every entry, it may also need stat to fill in metadata. Which predicates you use decides how expensive the walk is. A pure -name test needs only the names readdir already returned, and even -type can usually be answered from the directory entry itself on common filesystems. But -mtime, -size, -user, and -perm live in the inode, so every candidate file costs a stat call. On a tree with five million files, that is five million inode lookups, and the inodes are scattered across the disk in whatever order the filesystem allocated them. How directories, inodes, and that on-disk layout fit together is the subject of file systems, and you can watch a walk touch inodes one by one in the filesystem simulator.

On a local SSD this mostly stays tolerable. On NFS it becomes the famous stat storm: each stat of an uncached file is a network round trip, so a find that needs metadata for a million remote files issues a million sequential requests, saturating nothing and taking hours while also making the file server miserable for everyone else. The mitigation is to ask for less: prune aggressively, put the cheap name tests before the expensive metadata tests so fewer files survive to be stat-ed, and prefer -xdev so a search of the local disk never wanders onto the mount in the first place.

Then there is the limit that justifies xargs's existence. When a process runs another program, the arguments and environment are copied into the new process's memory by execve, and the kernel caps the total size — getconf ARG_MAX reports it, commonly a couple of megabytes on Linux. This is why rm *.tmp in a directory with a million files fails with Argument list too long: the shell expanded the glob into one argv that does not fit. The shell is not being fragile; the kernel rejected the exec. xargs is the workaround promoted to a tool — it reads an unbounded stream and slices it into argv loads that each fit under the cap, running the command once per slice. find's -exec ... + packs batches against the same budget. Once you know the limit exists, both tools stop looking like conveniences and start looking like the only correct way to apply one command to an arbitrarily long file list.

Pitfalls

Unquoted -name globs. find . -name *.log hands the glob to the shell before find ever runs. If the current directory has exactly one .log file, the shell expands the pattern to that one name and find silently searches the whole tree for files called, say, server.log — matching almost nothing, with no error to tip you off. With several matching files you get the baffling paths must precede expression instead, which at least fails loudly. Quote every pattern: -name '*.log'. No exceptions, even when it works without.

-delete is positional. find's predicates form an expression evaluated left to right, and actions are just predicates that have side effects. find . -delete -name '*.tmp' deletes everything — the delete runs before the name test gets a chance to filter, on every file the walk visits. The name test then runs on whatever is left, accomplishing nothing. The rule: -delete goes last, after every test, and the print-before-delete habit catches this too, because find . -print -name '*.tmp' shows you the unfiltered list before anything is at risk.

xargs without -0. Plain xargs splits on spaces, tabs, and newlines, and on top of that it interprets quote characters and backslashes as syntax — a filename containing don't produces unmatched single quote and a dead pipeline, and a filename with a space becomes two arguments pointing at files that do not exist. For rm, "a file that does not exist" can quietly become "a different file that does." Filenames are attacker-controlled input on any box where users upload things. The contract is mechanical: find says -print0, xargs says -0, every time the input is filenames.

-exec + with arguments after the braces. -exec mv {} /backup/ + does not run; find requires {} immediately before the +, because the batch is spliced onto the end. People hit this, shrug, and fall back to \; — paying a process per file for no reason. The fixes: GNU mv and cp accept -t DEST, which moves the destination ahead of the file list (-exec mv -t /backup/ {} +), and xargs never had the restriction.

xargs running once on empty input. If the find matches nothing, POSIX xargs still runs the command once with no file arguments — and some commands do something surprising with an empty file list (a bare grep pattern hangs reading stdin). GNU xargs offers -r (--no-run-if-empty) to skip the run instead. In scripts, add it by default.

A drill you can run right now

Everything below happens inside a directory you create in /tmp, so there is nothing to break. Ten minutes, and the three habits this page argues for — decoding -mtime before trusting it, printing before deleting, and pairing -print0 with -0 — become things your hands have done.

Step 1 — build a tree with known ages. touch -d lets you set modification times in the past, which means you can test age predicates against files whose ages you know exactly:

$ mkdir -p /tmp/find-drill/deep && cd /tmp/find-drill
$ touch fresh.log deep/nested.log "release notes.txt"
$ touch -d '3 days ago' three.log
$ touch -d '7 days 12 hours ago' boundary.log
$ touch -d '20 days ago' ancient.log

Before running anything, predict: which files will -mtime +7 return? Write your answer down, then check it against find . -name '*.log' -mtime +7. If you predicted boundary.log would match, reread the decoder — it is 7.5 days old, its integer age is 7, and +7 demands strictly greater. Try -mtime 7 and -mtime -7 too, and watch the three predicates partition your five log files with no overlaps. That partition is the number line diagram, run on files you made.

Step 2 — the print-before-delete habit. Pretend the drill directory is a build cache and the policy is "logs older than five days go." Selection first:

$ find . -name '*.log' -mtime +5 -print
./boundary.log
./ancient.log
$ find . -name '*.log' -mtime +5 -delete   ← same line, one word swapped, only after the list looked right
$ ls *.log
fresh.log  three.log

The habit to install is the editing motion itself: up-arrow, change -print to -delete, enter. Never retype the selection for the destructive pass — the verified line and the executed line must be the same line.

Step 3 — watch a space break the pipe, then fix it. You created release notes.txt in step 1 for exactly this moment:

$ find . -name '*.txt' | xargs ls -l
ls: cannot access './release': No such file or directory
ls: cannot access 'notes.txt': No such file or directory
$ find . -name '*.txt' -print0 | xargs -0 ls -l
-rw-r--r-- 1 nilesh nilesh 0 Jun  8 14:20 './release notes.txt'

One filename, two phantom arguments, and the broken version even exits nonzero, so in a script it would have failed loudly here — but remember the rm case, where the phantom names might exist and the failure is silent and destructive.

Step 4 — feel the parallelism. sleep makes the effect of -P visible with no files involved at all. Four sleeps of three seconds, serial and then four-way parallel:

$ time printf '3\n3\n3\n3\n' | xargs -n 1 sleep
real    0m12.041s    ← four runs, one after another
$ time printf '3\n3\n3\n3\n' | xargs -P 4 -n 1 sleep
real    0m3.019s     ← four runs at once — same work, a quarter of the wall clock

When you are done, clean up with the same discipline you just practised: find /tmp/find-drill -depth -print to see what would go, then swap -print for -delete. (The -depth makes the walk remove children before parents, which is what lets -delete take out the directories too — it is the ordering -delete would have switched on for itself.)

If you remember one line. find PATH tests -print first, stare at the list, then swap -print for the action. And whenever a pipe carries filenames, it is -print0 on the left and -0 on the right, no exceptions. The single-command crib for the predicates is on the shell cheat sheet.

find & xargs

The question they answer

The predicates and flags that matter

Reading the output

Three production scenarios

What filled the disk since yesterday

Cleaning old artifacts without an incident

Grepping a huge tree, fast

What happens underneath

Pitfalls

A drill you can run right now

Further reading

12 — jq