17 / 28

Linux / 17

df, du & ncdu

The pager says the disk is at 95% and climbing. The first command you type decides how the next twenty minutes go. df asks the filesystem how many blocks are allocated; du walks the directory tree and adds up what it can reach by name. Those are different questions, and the gap between their answers is where every interesting disk incident lives. This page covers the five invocations worth knowing, reads the output column by column, walks three production scenarios, explains why the numbers diverge, and ends with a drill that makes a directory's contents vanish and reappear without deleting a thing.

One question, two tools

"Where did the disk space go?" sounds like one question, but Linux gives you two ways to ask it, and they are not interchangeable. df — disk free — asks the filesystem. Every mounted filesystem keeps running counters of how many blocks it has, how many are in use, and how many inodes remain. df reads those counters and prints them. It is the accountant's view: fast, exact, and completely indifferent to what the blocks are being used for.

du — disk usage — asks the directory tree. It starts at a path you give it, walks every file and subdirectory it can reach by name, asks each one how many blocks it occupies, and sums as it goes. It is the surveyor's view: slower, granular, and limited to what has a name. A file that exists on disk but has no path the walk can reach simply does not appear in du's total.

Most days the two agree to within a rounding error and you never notice the distinction. The days you get paged are the days they disagree, and the disagreement is never a bug — it is the two tools faithfully reporting two different truths. Blocks held by a deleted-but-open file are real to df and invisible to du. Files buried under a mount point are counted by df and unreachable by du. A filesystem can refuse writes with plenty of blocks free, because the gauge that ran out was inodes, which only df -i shows. Reading the gap is the skill this page teaches.

The third tool, ncdu, is du with a user interface: it does the same tree walk once, holds the result in memory, and lets you drill into the biggest directory with the arrow keys instead of re-running du at each level. For "a directory got huge and I need to find which one," it turns a ten-command session into one. It inherits every one of du's blind spots, though, because it sees the world the same way: by name.

The five invocations that matter

Both man pages are long; the working set is small. These five cover nearly every disk investigation, and the order below is roughly the order you run them during one.

Invocation	What it reports	When you reach for it
`df -h`	Per-filesystem block usage, human-readable sizes	First. Which filesystem is full, and how full, in two seconds
`df -i`	Per-filesystem inode usage	"No space left on device" while `df -h` shows free space
`du -xh --max-depth=1 /path`	Size of each immediate subdirectory, staying on one filesystem	Narrowing down which directory holds the bulk, level by level
`ncdu -x /path`	The same walk, interactive: sort, drill in, drill out	Triage. Finding the one runaway directory in minutes
`du -sh --apparent-size f`	Logical file length rather than allocated blocks	Sparse files, compressed filesystems, "why is this 10G file using 80K"

Two of those flags deserve a sentence each before you ever type them. The -x on du and ncdu means "stay on one filesystem": without it, a walk that starts at / happily descends into /proc, /sys, every NFS mount, and every container overlay it finds, and the total it prints answers a question you did not ask. You are almost always investigating one full filesystem, so you almost always want -x. Make it reflex the way -nP is reflex for lsof.

The --max-depth=1 keeps du from printing a line for every directory in the tree. du still visits everything — it has to, the totals roll up from the leaves — but it only prints the first level, which is what you read. The classic loop is: run it at /, see that /var is the heavy one, run it again at /var, and repeat until you hit the directory that explains the number. Pipe through sort -rh so the biggest entry is the first line you see. Or skip the loop entirely and let ncdu do the descending for you.

One-liner worth memorising. sudo du -xh --max-depth=1 / | sort -rh | head — one filesystem, one level, biggest first. During an incident this single line usually points at the guilty directory before df's output has scrolled off your screen.

Reading the output

Here is df -h on a root filesystem that is about to ruin someone's evening. Five columns, and two of them hide arithmetic worth understanding.

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  916G  823G   47G  95% /

Filesystem is the block device or remote export backing the mount. Size is the filesystem's total capacity. Used is allocated blocks. Avail is what an ordinary process can still write. Mounted on is the path where this filesystem is attached. Now do the arithmetic the column headers invite you to do: 823 used plus 47 available is 870, but Size says 916. Forty-six gigabytes are missing from the table. That is not rounding. On ext4, by default, 5% of the blocks are reserved for root: ordinary processes cannot touch them, so they appear in neither Used nor Avail. The reservation exists so that when users fill the disk, root can still log in, daemons can still write logs, and the system stays repairable rather than wedging solid.

The reservation also explains a number that startles people: Use% is computed against the space available to non-root processes, not the raw size, so it reaches 100% while Avail still shows a few gigabytes — those are root's, not yours — and on some setups you may see it quoted above 100% when root has dipped into the reserve. When your service gets ENOSPC at "95%," this is usually why: 95% of the user-writable space plus the untouchable 5% is, for your process, completely full. The reservation is tunable per filesystem (tune2fs -m), and on huge data volumes that hold no system state, operators often shrink it — 5% of a 10 TB disk is half a terabyte of insurance you may not need.

Now the surveyor's view of the same disk, biggest first, staying on one filesystem:

$ sudo du -xh --max-depth=1 / | sort -rh | head
823G	/
512G	/var
198G	/home
67G	/usr
21G	/opt
9.8G	/root
2.1G	/srv
640M	/etc
24K	/tmp
16K	/lost+found

The first line is the grand total for the walk; every line after it is one immediate child. Two readings happen at once here. First, the drill-down read: /var holds 512 of 823 gigabytes, so the next command is the same one with /var as the argument, and you keep descending until the number stops being interesting. Second, the reconciliation read: compare du's 823G total against df's 823G Used. Here they match, which tells you everything on this filesystem has a name and the investigation is a plain "which directory grew" hunt. When they don't match — when df reports tens of gigabytes more than du can find — stop descending, because the space you are hunting has no name, and no amount of du will surface it. The scenarios below cover both branches.

ncdu presents the same walk as a screen you can move around in:

ncdu 1.19 ~ Use the arrow keys to move, press ? for help
--- / (one filesystem) -------------------------------------------
  512.3 GiB [##########] /var
  198.0 GiB [###       ] /home
   67.4 GiB [#         ] /usr
   21.2 GiB [          ] /opt
    9.8 GiB [          ] /root
    2.1 GiB [          ] /srv
 Total disk usage: 823.1 GiB   Items: 4,182,664

Enter descends into a directory, the left arrow backs out, n and s toggle sorting by name or size, and g switches the bar graph between percentages and blocks. There is also a d key that deletes the selected file or directory, which is exactly as dangerous during a 3am incident as it sounds; start ncdu with -r for read-only mode when you are on a box that matters, and the delete key stops existing. The Items count at the bottom is quietly useful too — four million items is a normal root filesystem, while forty million tiny files is a hint that your problem might be the inode gauge, which brings us to the scenarios.

Three production scenarios

df says 98%, du finds nothing

The volume is at 98% and rising. You run the du one-liner and the total comes to barely half of what df reports. You run it again with sudo in case permissions hid something. Same answer. Forty gigabytes are allocated on this filesystem and not one of them has a name.

This is the deleted-but-open file, and it is the most common df-versus-du gap in production. A process opened a log file, something deleted the file — logrotate misconfigured to rm instead of signalling, or a human tidying up — and the process kept writing. Deleting a file removes its name. The inode and its blocks stay allocated until the last open descriptor closes. du walks names, so the file is gone from its world; df reads the filesystem's block counters, so the space is still very much there, and still growing.

$ sudo lsof -nP +L1 | grep -v ' /dev'
COMMAND   PID   USER   FD   TYPE DEVICE    SIZE/OFF NLINK   NODE NAME
java    41327 deploy   4w   REG  259,2 42949672960     0 524291 /var/log/app/server.log (deleted)

There are your forty gigabytes: link count zero, still open for writing. The recovery — and the trick of truncating the file through /proc/PID/fd without restarting anything — is covered in detail on the lsof page, and the full decision tree for a filling disk, of which this is one branch, lives in why is the disk full? The lesson for this page is the diagnostic shape: when df and du disagree by a lot, the disagreement itself is the finding. Do not keep running du in more directories. Switch tools.

"No space left on device" with 60% free

A service starts failing every write with ENOSPC. You check df -h and the filesystem is at 40%. Restart the service; it fails again immediately. The error says the device is full and the block gauge says it is not even close. The gauge you have not checked is the other one:

$ df -h /data && df -i /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    1.8T  720G  1.1T  40% /data
Filesystem        Inodes    IUsed  IFree IUse% Mounted on
/dev/nvme1n1   117211136 117211136      0  100% /data

Every file needs an inode — the on-disk record that holds its metadata and points at its blocks — and on ext4 the inode table is sized once, at mkfs time. Run out of inodes and the filesystem cannot create a single new file, no matter how many free blocks remain. The usual culprit is millions of tiny files: a session store writing one file per login, a mail queue that stopped draining, a build cache exploding into node_modules confetti, a cron job that creates a temp file per run and never deletes any. Each file might be fifty bytes, which is why the block gauge barely moved while the inode gauge filled.

Finding the culprit is a counting problem rather than a sizing problem, so du's byte totals will not point at it. Count files per directory instead — something like find /data -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -rn | head — or watch ncdu's item counts rather than its sizes (press c to display child counts per directory). The fix is to delete or archive the file swarm; the prevention is to notice that df -i exists and put it on the same dashboard as df -h. The two gauges fill independently, and either one at 100% takes the filesystem down for writes.

The same filesystem, two independent gauges. Blocks at 40%, inodes at 100% — and ENOSPC on every file creation. df -i is the only standard tool that shows the bottom bar.

The runaway directory, found in minutes

The boring case is the common one: nothing is hidden, nothing is exhausted, a directory just grew. A debug flag left on after an incident has the application logging at trace level. A build cache nobody set a size limit on. A database's write-ahead logs piling up because the thing that consumes them stopped. Here df and du agree, and the job is pure descent: find the heaviest directory, then the heaviest directory inside it, until you hit something with a name you recognise.

You can do that with the du | sort -rh loop, and on a machine where you cannot install anything, you should — it is four or five invocations to get anywhere on a deep tree. ncdu -x on the suspect filesystem does the same walk once and then makes the descent free: enter, enter, enter, and twenty seconds after the scan finishes you are looking at /var/lib/app/cache/render/ holding 400 GB of files last touched eight months ago. The walk itself costs the same as du — it is the same work, stat everything — but you only pay it once instead of once per level. On a multi-terabyte filesystem with tens of millions of files, that walk can take real minutes either way, which is an argument for running it in tmux and an argument for the export trick: ncdu -x -o scan.json /data saves the scan to a file, and ncdu -f scan.json reopens it instantly, on this machine or your laptop.

What you do once you find the 400 GB is judgement, not tooling — confirm nothing has it open, archive rather than delete if there is any doubt, and fix whatever let it grow unbounded. The triage pattern is the part to keep: df -h to pick the filesystem, one reconciliation glance at du's total versus df's Used, then ncdu to descend. Three tools, one minute each, and the question "where did the disk space go" is answered before the next page fires.

Why the numbers diverge

Everything above falls out of two different system calls. df calls statfs() on each mount point. The filesystem answers from counters it keeps in its superblock: total blocks, free blocks, free inodes. No tree is walked; the cost is the same whether the filesystem holds ten files or ten million, which is why df returns instantly on any disk. The list of mounts it iterates comes from /proc/self/mountinfo — the same place the mount command reads, and one more entry in the long list of things exposed through /proc.

du calls stat() on every file the walk reaches and sums the st_blocks field — the number of 512-byte blocks actually allocated to the file. It deduplicates hard links as it goes, so a file with three names on the same filesystem is counted once. Its total is built name by name, and that is the whole story of its blind spots: anything a name-walk cannot reach does not exist for du. Deleted-but-open files are the famous case. The sneakier one is the mount shadow.

When you mount a filesystem on a directory, the directory's previous contents do not move and are not deleted — they become unreachable. Imagine a service that writes to /srv/data before its data volume is mounted: maybe the mount failed on one boot, maybe an init-order race let the service start first. Eighty gigabytes land on the root filesystem, in /srv/data. Then the volume mounts, the new filesystem covers the directory, and those eighty gigabytes vanish from view. du walking through /srv/data now descends into the mounted filesystem and never sees the files underneath. df on the root filesystem still counts every block. You get the classic gap, lsof +L1 comes back empty because nothing is deleted and nothing is open, and the space sits in plain sight behind the mount. The way to look is to bind-mount the parent somewhere else — mount --bind / /mnt/peek gives you a view of the root filesystem with no child mounts attached, and du -sh /mnt/peek/srv/data reads the shadowed bytes directly. The drill at the end builds this exact situation in /tmp so you can watch it happen.

What each tool can see on one filesystem. df counts every allocated block, named or not; du counts only what it can reach by name — and without -x it also counts things that are not on this disk at all.

One more source of divergence lives inside individual files. A file has two sizes: its apparent size (the length, st_size — how many bytes you would get reading it end to end) and its disk usage (allocated blocks). A sparse file pushes these wildly apart: write one byte at offset ten gigabytes and the filesystem stores one block plus a note that the rest reads as zeros. ls -l shows ten gigabytes; du shows a few kilobytes; both are correct. Disk images, core dumps, and database preallocations are sparse all the time. du --apparent-size switches to summing lengths, which is the number you want when asking "how big would this be if I tar'd it up" and the wrong number when asking "what is filling this disk." Filesystems with transparent compression bend the same two numbers the other way — more on that in the pitfalls. The block-and-inode machinery underneath all of this is covered in file systems, and you can watch inodes, blocks, and directory entries get allocated by hand in the filesystem simulator.

The mount shadow. Nothing is deleted and nothing is open, so the usual ghost-file hunt comes back empty — the blocks are simply behind a door that the mount closed.

Pitfalls

Running du without -x and trusting the total. A du rooted at / without -x crosses into every mounted filesystem it meets, and its grand total describes the union of all of them. You then compare that number against df for one filesystem and conclude something impossible is happening. Worse, the walk descends into places that are not disks at all and stats things that misbehave when stat'd. If a number is going to drive a decision, it should describe one filesystem, and -x is what makes that true.

Pointing du or ncdu at an NFS tree. The tree walk issues a metadata request per file, and over NFS each one is a network round trip. On a big export that is millions of round trips: slow for you and a genuine load spike for the file server, which everyone else is using too. If the server has a hung peer or a flaky path, the walk can block indefinitely on a single stat with no timeout you control. Measure usage of a network filesystem from the server side where the walk is local, or with the server's own accounting (a quota report answers in milliseconds what a client-side du answers in an hour).

Forgetting the root reservation cuts both ways. Use% at 100 with gigabytes in Avail confuses people in one direction; the other direction bites harder. Your monitoring fires at 90%, someone "fixes" the alert by noting the disk has 5% nobody can use and reclaiming it with tune2fs -m 0, and the next time the disk fills, it fills completely — root included. Now the cleanup tooling cannot write a temp file, the package manager cannot run, and a full disk has been upgraded to an unrecoverable shell. Keep at least a small reservation on any filesystem the OS depends on.

Expecting honest numbers from btrfs and ZFS. Both report through statfs() because they must, but the question barely fits. Snapshots share extents, so "how much would deleting this free" has no single answer; compression means bytes written and blocks allocated diverge per file; on ZFS, pool space is shared by every dataset, so the same free blocks show up in several mounts' Avail at once, and on btrfs, metadata is allocated in chunks that can exhaust separately from data — ENOSPC with df showing free space, this time with inodes innocent too. Use the native tools: btrfs filesystem usage and zfs list -o space answer the questions df is mis-asking.

Reading du's number as "the size of the data." du reports allocated blocks, which tracks what the disk loses, not what the bytes weigh. Sparse files deflate it, compression deflates it, reflinked copies on modern filesystems deflate it, and small files inflate it (a 100-byte file occupies a full block). When you are estimating a transfer, a backup, or an S3 bill, use --apparent-size; when you are freeing a disk, use the default. Mixing them up produces estimates that are wrong by integer factors, in whichever direction is most embarrassing.

A drill you can run right now

Everything below is safe on any Linux machine: it reads state, creates a small scratch directory in /tmp, and mounts nothing over anything that matters. The bind-mount step needs sudo; everything else does not. Fifteen minutes, and the block gauge, the inode gauge, and the mount shadow stop being trivia.

Step 1 — both gauges, every filesystem. Run df -h and read it properly for once: find your root filesystem, check whether Used plus Avail equals Size, and compute the gap — that is your root reservation. Then run df -i and look at IUse% for the same filesystem. Most people have never once looked at this column on a machine they operate; note which of your filesystems is closest to inode exhaustion, because it is probably not the one closest to block exhaustion.

Step 2 — survey your own home. Run du -xh --max-depth=1 ~ | sort -rh | head. Descend once or twice into the biggest entry by hand, then do the same survey with ncdu -rx ~ (read-only, one filesystem) and feel the difference: the walk happens once and the descent is free. While you are in there, press c to show item counts and find your own file swarms — caches and package directories with six-digit counts are normal and worth knowing about.

Step 3 — build a mount shadow and catch it. Create a file, hide it behind a bind mount, and watch du lose it while the blocks stay allocated:

$ mkdir -p /tmp/shadow/under /tmp/shadow/cover
$ dd if=/dev/zero of=/tmp/shadow/under/hidden.bin bs=1M count=64 status=none
$ du -sh /tmp/shadow
65M	/tmp/shadow
$ df -h /tmp | tail -1
tmpfs            16G  66M   16G   1% /tmp
$ sudo mount --bind /tmp/shadow/cover /tmp/shadow/under
$ ls /tmp/shadow/under
$ du -sh /tmp/shadow
8.0K	/tmp/shadow
$ df -h /tmp | tail -1
tmpfs            16G  66M   16G   1% /tmp
$ sudo umount /tmp/shadow/under
$ ls /tmp/shadow/under
hidden.bin
$ rm -r /tmp/shadow

Walk through what just happened. Before the mount, du and df agreed: 64 MB of file, 64-ish MB of blocks. The bind mount placed an empty directory over under, and du's answer collapsed to nearly nothing — the walk now descends into the covering directory and hidden.bin has no reachable name. But df still reports the 66 MB, because the blocks never stopped being allocated. Nothing was deleted, nothing holds the file open, so the ghost-file hunt from the first scenario would come back empty; this gap has a different cause and a different cure. Unmount, and the file is back, untouched. That is the entire mount-shadow story, performed on 64 harmless megabytes in /tmp instead of 80 mystery gigabytes on a production root volume.

Step 4 — see a sparse file disagree with itself. No sudo needed: truncate -s 10G /tmp/sparse.img creates a ten-gigabyte file in zero time. Now ask its size both ways: du -h /tmp/sparse.img says 0, and du -h --apparent-size /tmp/sparse.img says 10G. ls -lh agrees with the second; df agrees with the first. Both are telling the truth about different things, which by this point in the page should feel familiar. Delete it with rm /tmp/sparse.img and you are done.

If you remember one sequence. df -h to pick the filesystem, df -i to rule out inodes, sudo du -xh --max-depth=1 MOUNT | sort -rh | head to reconcile and descend — and when df and du disagree, stop descending and ask what has no name: a deleted-but-open file, or a mount shadow.

df, du & ncdu

One question, two tools

The five invocations that matter

Reading the output

Three production scenarios

df says 98%, du finds nothing

"No space left on device" with 60% free

The runaway directory, found in minutes

Why the numbers diverge

Pitfalls

A drill you can run right now

Further reading

18 — ip