22 / 28

Linux / 22

ulimit & limits

The service died at peak traffic with accept4: too many open files, and someone in the channel says "just raise the ulimit." That is where the trouble starts, because there is no single ulimit. The number a process lives under can be set in five different places: your shell, PAM, systemd, the container runtime, and the kernel itself, and they do not consult each other. This page shows you how to read the limit that actually applied, trace where it came from, change it on a process that is already running, and understand why editing limits.conf so often does exactly nothing.

The question it answers

Every process on Linux carries a private set of resource limits, the rlimits. There is one for open file descriptors (RLIMIT_NOFILE), one for the number of processes and threads the same user may have (RLIMIT_NPROC), one for locked memory (RLIMIT_MEMLOCK), one for core dump size, stack size, CPU seconds, and a handful more. Each limit is a pair of numbers, a soft value and a hard value, and the kernel checks the soft one at the moment a process tries to cross it. Try to open file descriptor number 1024 when the soft nofile limit is 1024 and open() or accept4() returns EMFILE, which your runtime prints as "too many open files." Try to fork past the nproc limit and fork() returns EAGAIN, which the shell prints as "resource temporarily unavailable."

The error message names the resource. It never names the place the number came from, and that is the question that actually matters during an incident. A limit can be set by the kernel's built-in default, by PID 1 when it started the world, by a line in a systemd unit file, by PAM at login, by a ulimit command in a shell or a wrapper script, by a container runtime, or by the program itself calling setrlimit(). Whichever one of those touched the value last before your process started is the one that won, because limits are inherited down the process tree and then frozen unless somebody changes them deliberately. So "raise the ulimit" is not one action. It is five possible actions, and four of them will have no effect on the process that is actually failing.

The good news is that the diagnostic side is simple once you know the one lookup most people never learn: the kernel will show you the exact limits of any running process, soft and hard, in a file under /proc. Start from there, work backwards to where the value was set, and the fix usually writes itself. The rest of this page is that path: the five lookups, how to read the output, three incidents where a limit was the villain, and what the inheritance machinery looks like underneath.

Soft, hard, and who may move what

Before the lookups, the two-number scheme deserves a proper explanation, because half of all limit confusion is people treating "the ulimit" as one value. The soft limit is the number the kernel enforces. It is the fence your process actually runs into. The hard limit is the ceiling on the soft limit: a process may move its own soft limit anywhere it likes, up or down, as long as it stays at or below the hard limit, and it needs no privilege to do so. Raising the hard limit is the privileged operation; it requires root, or more precisely the CAP_SYS_RESOURCE capability. A process may also lower its own hard limit without privilege, and that move is one-way: once lowered, an unprivileged process cannot get it back.

Two fences, two rules. Enforcement happens at the soft fence; the hard fence only bounds how far the soft one can move; fs.nr_open bounds the hard fence itself.

This split exists so that defaults can be conservative without being a prison. The classic example is nofile itself: systemd hands services a soft limit of 1024 and a hard limit of 524288 on purpose. The 1024 is not laziness. The old select() system call cannot represent descriptors numbered 1024 or higher, so a program that gets handed fd 2000 and passes it to select() corrupts memory. Keeping the soft default at 1024 protects every legacy program, while any modern server that uses epoll is expected to raise its own soft limit to the hard one at startup with a single setrlimit() call. Well-behaved servers do. The ones that page you do not.

One more fence sits past the hard limit. The sysctl fs.nr_open (one million and change by default) is the maximum value any process's hard nofile limit can be raised to, even by root. And separate from all the per-process numbers, fs.file-max caps the total number of open file descriptions across the whole machine. Run out of that pool and you get a different errno, ENFILE, and a kernel log line. Keep the two errors distinct in your head: EMFILE means this process hit its own fence, ENFILE means the machine as a whole did.

The five lookups

Five commands cover everything. Each answers a slightly different question, and reaching for the wrong one is how people convince themselves a limit is fine when it is not.

Lookup	What it shows	When it is the right one
`ulimit -n`, `ulimit -a`	The current shell's soft limits; add `-H` for the hard ones	Sanity-checking what a command launched from this shell will inherit. Says nothing about any other process.
`cat /proc/PID/limits`	The live soft and hard limits of a running process, straight from the kernel	Always, for anything already running. This is the truth, and the lookup most engineers never learn.
`prlimit --pid PID`	Reads the same table, and with `--nofile=soft:hard` changes it in place	Raising a limit on a live process during an incident, without a restart.
`systemctl show svc -p LimitNOFILE`	What systemd will hand the service at its next start	Any daemon under systemd. `limits.conf` does not apply to these, ever.
`sysctl fs.file-max fs.nr_open`	The machine-wide descriptor pool, and the ceiling on any per-process hard limit	When the error is `ENFILE`, or a huge nofile value refuses to take.

The first one is the famous one and the least useful. ulimit is a shell builtin, not a program; it reads and writes the limits of the shell process itself, which children then inherit. ulimit -a prints the full set, ulimit -n prints the soft nofile limit, and ulimit -Hn prints the hard one. The output describes exactly one process: your shell. The daemon that is failing in production was almost certainly not started from a shell, so this number is usually a red herring, and an entire generation of runbooks that say "check ulimit -n" are checking the wrong process.

The second is the one to internalise. /proc/PID/limits is a kernel-rendered view of the failing process's own limit table. No inference, no guessing what it inherited; the numbers in that file are the numbers the kernel is enforcing on it right now. When the pager fires, your first move is cat /proc/PID/limits on the process that threw the error, and your second move is comparing what you find against what each layer claims it set. The annotated read-through is in the next section.

The third, prlimit, is the same table with write access. Without arguments it prints a process's limits in a tidy table; with --pid 41327 --nofile=65536:524288 it changes the soft and hard nofile values of PID 41327 while it runs, no restart, taking effect on the very next open(). Raising another process's limit, or any hard limit, needs root. This is the incident stopgap: it buys you hours to do the unit-file fix properly instead of restarting a wounded service at peak.

The fourth is for services. If the process was started by systemd, its limits came from the unit file's LimitNOFILE= (and siblings like LimitNPROC= and LimitMEMLOCK=), falling back to DefaultLimitNOFILE= in /etc/systemd/system.conf, falling back to systemd's compiled defaults. systemctl show myservice -p LimitNOFILE -p LimitNOFILESoft prints what the next start will get. The unit-file mechanics, drop-in overrides and daemon-reload included, are covered in systemctl.

The fifth is the kernel layer, and it is almost never your problem on a modern machine, because systemd raises fs.file-max to an effectively unreachable value at boot. It earns its place in the table for two reasons: ENFILE still happens on older or hand-tuned systems, and fs.nr_open is the silent reason a LimitNOFILE=infinity or an enormous setrlimit() request gets refused or clamped. The kernel will not let any hard nofile limit exceed it.

Reading /proc/PID/limits

Here is the file for a Java service that is about to have a bad day, with the lines that matter highlighted. Every process has one of these; substitute any PID, or self for the reading process.

$ cat /proc/41327/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max processes             63990                63990                processes
Max open files            1024                 524288               files
Max locked memory         8388608              8388608              bytes
Max address space         unlimited            unlimited            bytes
Max pending signals       63990                63990                signals
Max nice priority         0                    0
Max realtime priority     0                    0

Read it column by column. The first column names the resource in human words rather than the RLIMIT_* constant; "Max open files" is RLIMIT_NOFILE, "Max processes" is RLIMIT_NPROC, "Max locked memory" is RLIMIT_MEMLOCK. The second and third columns are the soft and hard values, in the units of the last column. Now the highlighted rows. Max open files reads 1024 soft against 524288 hard: the classic systemd service default, and the signature of a server that never raised its own soft limit. There is a fence 512 times further out that nobody moved this process up to. Max processes is the fork-and-thread budget, and note that it is counted against the user, not the process: the kernel compares the total number of tasks belonging to this real UID across the whole machine. One process leaking threads can therefore starve every other process the same user owns. Max locked memory is in bytes, and 8388608 is 8 MiB, the modern default; older systems show 65536, a number with consequences we will get to.

Two smaller observations. "Max core file size 0" soft against unlimited hard is the standard arrangement for core dumps: disabled by default, but any process may opt in by raising its own soft limit, which is exactly the soft/hard design doing its job. And "unlimited" in this file is a real value (RLIM_INFINITY), not an absence; for most resources it means what it says, with one important exception for nofile that lives in the pitfalls.

When the limit is hit, this is what the failure looks like from underneath, courtesy of strace. The error anatomy is worth recognising on sight:

# the per-process fence: soft RLIMIT_NOFILE exhausted
accept4(7, ..., SOCK_CLOEXEC)  = -1 EMFILE (Too many open files)

# the machine-wide pool: fs.file-max exhausted (rare on modern boxes)
openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = -1 ENFILE (Too many open files in system)
# dmesg says: VFS: file-max limit 1000000 reached

# the user-wide task budget: RLIMIT_NPROC exhausted
clone(...)                     = -1 EAGAIN (Resource temporarily unavailable)

Three different errors, three different fences, one shared symptom of "the service stopped working." The English strings are close enough to confuse a tired engineer; the errno names are not. EMFILE sends you to /proc/PID/limits and then to lsof to find out what all those descriptors are. ENFILE sends you to sysctl fs.file-max and a hunt for whichever process is eating the global pool. EAGAIN from fork or clone sends you counting tasks, not files.

Three production scenarios

The proxy that drowned at 1024

A reverse proxy has run for months. Marketing buys a spot during something televised, traffic triples, and the proxy starts refusing connections with EMFILE in its logs. A proxy holds two descriptors per connection, one to the client and one upstream, plus listeners, logs, and resolver sockets, so a four-digit connection count walks straight into a 1024 fence. The interesting work is not raising the number, it is tracing where the number came from, because the fix has to land at the layer that set it:

$ grep 'open files' /proc/9131/limits
Max open files            1024                 524288               files
$ systemctl show proxy.service -p LimitNOFILE -p LimitNOFILESoft
LimitNOFILE=524288
LimitNOFILESoft=1024
# stopgap, takes effect immediately, survives nothing:
$ sudo prlimit --pid 9131 --nofile=65536:524288
# real fix, survives restarts:
$ sudo systemctl edit proxy.service   # add: [Service] LimitNOFILE=65536
$ sudo systemctl daemon-reload && sudo systemctl restart proxy.service

Walk the chain. Was the process started from a shell? No, so every ulimit in every dotfile is irrelevant. Was it started by systemd? Yes, so its limits are exactly what the unit gave it, and systemctl show confirms the soft value is the inherited default, untouched by the unit file. Is it in a container? Then there is one more layer: the runtime sets limits on the container's init process and everything inside inherits from there, so the same investigation moves inside with docker exec proxy cat /proc/1/limits, and the fix becomes --ulimit nofile=65536:524288 on the run command or the daemon's default-ulimits. Same fences, one more place they can be planted. The prlimit stopgap is genuinely useful here: it unbreaks production in two seconds, on the live process, while the unit-file change waits for a calm moment to restart.

fork: retry: Resource temporarily unavailable

A deploy user's cron jobs start failing, and anyone who tries sudo -u deploy bash gets fork: retry: Resource temporarily unavailable. Nothing is out of memory. This is RLIMIT_NPROC, and the detail that cracks the case is the one from the annotated output above: the count is per user, across every process that UID owns, and threads count as tasks. The usual culprit is not ten thousand processes; it is one process with ten thousand leaked threads:

$ ps -eLF --no-headers -u deploy | wc -l
63988
$ ps -o pid,nlwp,comm -u deploy --sort=-nlwp | head -3
    PID  NLWP COMMAND
  55012 63871 sync-worker      # one PID owns nearly all of them: a thread leak
  55400     9 gunicorn
$ grep processes /proc/55012/limits
Max processes             63990                63990                processes

63988 tasks against a fence of 63990: every new thread or fork by this user fails, including the innocent cron jobs and your debugging shell, while the leaking process itself sits there healthy because it is not trying to spawn anything. Restarting the leaker frees the budget instantly; the durable fix is in its code. Two adjacent fences are worth knowing while you are here. systemd also enforces TasksMax= per service through the cgroup pids controller, which produces the same EAGAIN with a different ceiling (systemctl show svc -p TasksMax reveals it), and it is not an rlimit at all. And the per-user task accounting is one of the reasons running everything as one shared UID on a box is a quiet act of self-sabotage: one tenant's leak takes out every tenant's fork.

The 64-kilobyte fence nobody remembers planting

A perf engineer ships a tool built on io_uring, or an SRE runs an eBPF tracer, and it dies on startup with ENOMEM or a permission error on a machine with two hundred free gigabytes. The fence here is RLIMIT_MEMLOCK, the cap on memory a process may lock so it can never be paged out. Ring buffers shared with the kernel have to be locked, registered io_uring buffers count against it, and on kernels before 5.11 every eBPF map and program was charged to it too, which is why older hosts are where this bites hardest. The historical default soft limit was 64 KiB, a number from an era when locking memory was exotic, and 64 KiB disappears the moment a modern tool sets up its first ring:

$ grep 'locked memory' /proc/61204/limits
Max locked memory         65536                65536                bytes
# io_uring_queue_init: Cannot allocate memory
# bpf(BPF_MAP_CREATE, ...) = -1 EPERM (kernels before 5.11)
$ sudo prlimit --pid 61204 --memlock=1073741824:1073741824   # live
# or in the unit file: LimitMEMLOCK=1G  (and restart)

Newer stacks have softened this from both ends: systemd raised its default to 8 MiB, and kernel 5.11 moved BPF accounting to the cgroup memory controller, so the same tracer that needed LimitMEMLOCK=infinity on an old host runs untouched on a new one. That inconsistency is itself the lesson. The error a tool throws depends on the kernel, the distro, and which init started it, so skip the folklore and read the process's own limits file first. It takes four seconds and it ends arguments.

What is underneath: inheritance, and the two roads from PID 1

All of the layer confusion dissolves once you see the mechanism. Resource limits are a small table inside the kernel's per-process state, sixteen pairs of numbers, nothing more. When a process calls fork(), the child gets a copy of the parent's table. When it calls execve() to become a different program, the table survives untouched. That is the entire propagation model: limits flow down the process tree by copying, at creation time, and no process ever re-reads any configuration file afterwards. There is no subscription, no refresh, no daemon watching limits.conf. A limit is wherever it is because some ancestor set it before the fork, or because something with enough privilege reached in later with prlimit(). The fork-and-exec machinery itself is the subject of processes; limits are simply one more thing it copies.

Why the daemon and your shell disagree about "the ulimit": they took different roads from PID 1, and each road has its own knobs.

Now the two roads. When systemd starts a service, PID 1 forks, sets the limits the unit demands, and execs the daemon. That is the whole story for the service path: there is no login, no PAM session, and therefore /etc/security/limits.conf is never opened. That file is not system configuration in any general sense. It is the input to one specific PAM module, pam_limits.so, which runs only when something opens a PAM session: an SSH login, a console login, su -, a cron job on most distributions. So limits.conf governs the limits of human sessions and the things humans launch, while unit files govern daemons, and the two never touch. This is the answer to the perennial mystery ticket: "I raised nofile in limits.conf, rebooted, and the service still has 1024." Of course it does. The service's road never passes that file.

The same model explains the other classic. "I edited limits.conf, why does my running daemon still show the old value?" would be unanswerable if limits were a live setting; it is obvious once you know they are copied at fork. The daemon's table was written once, at start, from whatever its parent had plus whatever its parent applied. Configuration edited afterwards is just bytes on disk that nothing will read until the next start. The only ways to change a running process's limits are from inside (the process calls setrlimit() on itself) or from outside with prlimit(), which is exactly what the prlimit tool wraps. Everything else is a restart wearing a disguise.

Pitfalls

Raising the soft limit and ignoring the hard one, or the reverse. The pair moves under different rules, and half-raised limits fail in confusing ways. Set LimitNOFILE=65536 in a unit and systemd sets both values to 65536, which silently lowers the hard limit from 524288 and takes headroom away from a process that would have raised itself. Use the two-value form, LimitNOFILE=8192:524288, when you mean different numbers. In a shell, plain ulimit -n 65536 sets both; and if you lower the hard limit in a shell experiment, that shell cannot undo it, because lowering a hard limit is the irreversible unprivileged move. New shell, fresh table.

Forgetting the container runtime is a fifth fence-setter. A containerised process inherits from the runtime, not from your host shell and not from the service manager inside the container image. Docker takes --ulimit nofile=4096:8192 per container and default-ulimits in the daemon config, and the runtime's own service unit feeds the chain above that. The cautionary tale is the containerd unit that shipped LimitNOFILE=infinity for years: containers inherited a hard limit around a billion, and programs that "close all possible descriptors before exec" by looping from 0 to the hard limit, a common daemonising idiom, sat burning CPU for minutes. Limits that are too high break things too, just more creatively.

Confusing fs.file-max with the per-process limit. They are different fences around different fields. Raising fs.file-max when a process throws EMFILE does nothing, because the process hit its own nofile fence, not the machine's pool. The reverse mistake also happens: cranking per-process limits to the moon on a box whose global pool is modest just moves the failure to a worse place, since ENFILE hits every process at once instead of the greedy one.

Believing "unlimited" means unlimited for nofile. For most resources RLIM_INFINITY is genuine. For open files it cannot be: the kernel requires the hard nofile limit to stay at or below fs.nr_open, and a setrlimit() asking for more is refused with EPERM, a confusing errno for "the number was too big." systemd papers over this by translating LimitNOFILE=infinity into the current fs.nr_open value, which is tidy but means "infinity" quietly equals about a million, and a different million on a host where someone tuned the sysctl. If a service needs a known descriptor budget, write the number.

Checking the shell when the patient is a daemon. Worth restating as a pitfall because it is the single most common limits mistake: ulimit -n in your SSH session describes your SSH session. The daemon took the other road. Check /proc/PID/limits, always, and treat any limits claim that is not backed by that file as a rumour.

A drill you can run right now

Everything below is safe on any Linux machine, shared ones included: it reads state, lowers a limit only inside throwaway child processes, and restores everything by letting those children exit. Ten minutes, and soft versus hard, the two roads, and the live-change trick stop being theory.

Step 1 — read your own table, twice. Run ulimit -a and read every line; this is the soft column. Run ulimit -Hn and compare it to ulimit -n; on most distros you will find the 1024-against-a-much-bigger-number split this page keeps returning to. Then get the same answer from the kernel with cat /proc/$$/limits (the shell expands $$ to its own PID) and confirm the builtin and the file agree. They must: they are two views of one table.

Step 2 — compare your shell to a daemon. Pick any running service and put the two roads side by side: systemctl show sshd -p MainPID -p LimitNOFILESoft, then cat /proc/THATPID/limits (root may be needed to read another user's process). Notice what matches and what does not against your own shell's table: the daemon's numbers came from a unit file and PID 1, yours came through PAM and limits.conf, and nothing forces them to agree.

Step 3 — hit the fence on purpose. Lower the soft nofile limit inside a child shell and watch an ordinary command die, while your own shell stays untouched:

$ bash -c 'ulimit -n 3; cat /etc/hostname'
cat: /etc/hostname: Too many open files
$ ulimit -n
1024

Sit with why that failed. A nofile limit of 3 permits descriptors 0, 1, and 2, which the child shell already holds as stdin, stdout, and stderr. cat needs descriptor 3 to open the file, the kernel refuses with EMFILE, and you have just reproduced the production error with a one-line command. The second command shows the other half of the lesson: the parent shell still reads 1024, because the child got a copy of the table and its changes died with it. Inheritance is copying; copying is one-way.

Step 4 — change a live process with prlimit. Start a victim, squeeze it, inspect it, release it:

$ sleep 600 &
[1] 70233
$ prlimit --pid 70233 --nofile=64:1024
$ grep 'open files' /proc/70233/limits
Max open files            64                   1024                 files
$ prlimit --pid 70233 --nofile=1024:1024
$ kill %1

No restart, no signal, no cooperation from the process: the table changed while sleep ran, and /proc showed the change instantly. You could do this unprivileged because the process is yours and the values stayed at or under the hard limit. During a real incident the same two commands, with sudo and bigger numbers, are the difference between "fixed in ten seconds" and "restarted the only healthy instance."

If you remember one line. cat /proc/PID/limits is the truth for any running process; prlimit --pid PID --nofile=S:H changes it live; and systemd services take their limits from the unit file, never from limits.conf.

ulimit & limits

The question it answers

Soft, hard, and who may move what

The five lookups

Reading /proc/PID/limits

Three production scenarios

The proxy that drowned at 1024

fork: retry: Resource temporarily unavailable

The 64-kilobyte fence nobody remembers planting

What is underneath: inheritance, and the two roads from PID 1

Pitfalls

A drill you can run right now

Further reading

23 — awk & sed