ulimit & limits
The service died at peak traffic with accept4: too many open files, and someone
in the channel says "just raise the ulimit." That is where the trouble starts, because there
is no single ulimit. The number a process lives under can be set in five different places:
your shell, PAM, systemd, the container runtime, and the kernel itself, and they do not
consult each other. This page shows you how to read the limit that actually applied, trace
where it came from, change it on a process that is already running, and understand why
editing limits.conf so often does exactly nothing.
The question it answers
Every process on Linux carries a private set of resource limits, the rlimits. There is one
for open file descriptors (RLIMIT_NOFILE), one for the number of processes and
threads the same user may have (RLIMIT_NPROC), one for locked memory
(RLIMIT_MEMLOCK), one for core dump size, stack size, CPU seconds, and a handful
more. Each limit is a pair of numbers, a soft value and a hard value, and the kernel checks
the soft one at the moment a process tries to cross it. Try to open file descriptor number
1024 when the soft nofile limit is 1024 and open() or accept4()
returns EMFILE, which your runtime prints as "too many open files." Try to
fork past the nproc limit and fork() returns EAGAIN, which the
shell prints as "resource temporarily unavailable."
The error message names the resource. It never names the place the number came from, and
that is the question that actually matters during an incident. A limit can be set by the
kernel's built-in default, by PID 1 when it started the world, by a line in a systemd unit
file, by PAM at login, by a ulimit command in a shell or a wrapper script, by a
container runtime, or by the program itself calling setrlimit(). Whichever one
of those touched the value last before your process started is the one that won, because
limits are inherited down the process tree and then frozen unless somebody changes them
deliberately. So "raise the ulimit" is not one action. It is five possible actions, and four
of them will have no effect on the process that is actually failing.
The good news is that the diagnostic side is simple once you know the one lookup most people never learn: the kernel will show you the exact limits of any running process, soft and hard, in a file under /proc. Start from there, work backwards to where the value was set, and the fix usually writes itself. The rest of this page is that path: the five lookups, how to read the output, three incidents where a limit was the villain, and what the inheritance machinery looks like underneath.
Soft, hard, and who may move what
Before the lookups, the two-number scheme deserves a proper explanation, because half of all
limit confusion is people treating "the ulimit" as one value. The soft limit
is the number the kernel enforces. It is the fence your process actually runs into. The
hard limit is the ceiling on the soft limit: a process may move its own soft
limit anywhere it likes, up or down, as long as it stays at or below the hard limit, and it
needs no privilege to do so. Raising the hard limit is the privileged operation; it
requires root, or more precisely the CAP_SYS_RESOURCE capability. A process may
also lower its own hard limit without privilege, and that move is one-way: once lowered, an
unprivileged process cannot get it back.
This split exists so that defaults can be conservative without being a prison. The classic
example is nofile itself: systemd hands services a soft limit of 1024 and a hard limit of
524288 on purpose. The 1024 is not laziness. The old select() system call
cannot represent descriptors numbered 1024 or higher, so a program that gets handed fd 2000
and passes it to select() corrupts memory. Keeping the soft default at 1024
protects every legacy program, while any modern server that uses epoll is
expected to raise its own soft limit to the hard one at startup with a single
setrlimit() call. Well-behaved servers do. The ones that page you do not.
One more fence sits past the hard limit. The sysctl fs.nr_open (one million and
change by default) is the maximum value any process's hard nofile limit can be raised to,
even by root. And separate from all the per-process numbers, fs.file-max caps
the total number of open file descriptions across the whole machine. Run out of that pool
and you get a different errno, ENFILE, and a kernel log line. Keep the two
errors distinct in your head: EMFILE means this process hit its own fence,
ENFILE means the machine as a whole did.
The five lookups
Five commands cover everything. Each answers a slightly different question, and reaching for the wrong one is how people convince themselves a limit is fine when it is not.
| Lookup | What it shows | When it is the right one |
|---|---|---|
ulimit -n, ulimit -a | The current shell's soft limits; add -H for the hard ones | Sanity-checking what a command launched from this shell will inherit. Says nothing about any other process. |
cat /proc/PID/limits | The live soft and hard limits of a running process, straight from the kernel | Always, for anything already running. This is the truth, and the lookup most engineers never learn. |
prlimit --pid PID | Reads the same table, and with --nofile=soft:hard changes it in place | Raising a limit on a live process during an incident, without a restart. |
systemctl show svc -p LimitNOFILE | What systemd will hand the service at its next start | Any daemon under systemd. limits.conf does not apply to these, ever. |
sysctl fs.file-max fs.nr_open | The machine-wide descriptor pool, and the ceiling on any per-process hard limit | When the error is ENFILE, or a huge nofile value refuses to take. |
The first one is the famous one and the least useful. ulimit is a shell builtin,
not a program; it reads and writes the limits of the shell process itself, which children
then inherit. ulimit -a prints the full set, ulimit -n prints the
soft nofile limit, and ulimit -Hn prints the hard one. The output describes
exactly one process: your shell. The daemon that is failing in production was almost
certainly not started from a shell, so this number is usually a red herring, and an
entire generation of runbooks that say "check ulimit -n" are checking the wrong process.
The second is the one to internalise. /proc/PID/limits is a kernel-rendered
view of the failing process's own limit table. No inference, no guessing what it inherited;
the numbers in that file are the numbers the kernel is enforcing on it right now. When the
pager fires, your first move is cat /proc/PID/limits on the process that threw
the error, and your second move is comparing what you find against what each layer claims
it set. The annotated read-through is in the next section.
The third, prlimit, is the same table with write access. Without arguments it
prints a process's limits in a tidy table; with --pid 41327 --nofile=65536:524288
it changes the soft and hard nofile values of PID 41327 while it runs, no restart, taking
effect on the very next open(). Raising another process's limit, or any hard
limit, needs root. This is the incident stopgap: it buys you hours to do the unit-file fix
properly instead of restarting a wounded service at peak.
The fourth is for services. If the process was started by systemd, its limits came from the
unit file's LimitNOFILE= (and siblings like LimitNPROC= and
LimitMEMLOCK=), falling back to DefaultLimitNOFILE= in
/etc/systemd/system.conf, falling back to systemd's compiled defaults.
systemctl show myservice -p LimitNOFILE -p LimitNOFILESoft prints what the next
start will get. The unit-file mechanics, drop-in overrides and
daemon-reload included, are covered in
systemctl.
The fifth is the kernel layer, and it is almost never your problem on a modern machine,
because systemd raises fs.file-max to an effectively unreachable value at boot.
It earns its place in the table for two reasons: ENFILE still happens on
older or hand-tuned systems, and fs.nr_open is the silent reason a
LimitNOFILE=infinity or an enormous setrlimit() request gets
refused or clamped. The kernel will not let any hard nofile limit exceed it.
Reading /proc/PID/limits
Here is the file for a Java service that is about to have a bad day, with the lines that
matter highlighted. Every process has one of these; substitute any PID, or self
for the reading process.
$ cat /proc/41327/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max processes 63990 63990 processes Max open files 1024 524288 files Max locked memory 8388608 8388608 bytes Max address space unlimited unlimited bytes Max pending signals 63990 63990 signals Max nice priority 0 0 Max realtime priority 0 0
Read it column by column. The first column names the resource in human words rather than the
RLIMIT_* constant; "Max open files" is RLIMIT_NOFILE, "Max
processes" is RLIMIT_NPROC, "Max locked memory" is
RLIMIT_MEMLOCK. The second and third columns are the soft and hard values, in
the units of the last column. Now the highlighted rows. Max open files
reads 1024 soft against 524288 hard: the classic systemd service default, and the signature
of a server that never raised its own soft limit. There is a fence 512 times further out
that nobody moved this process up to. Max processes is the fork-and-thread
budget, and note that it is counted against the user, not the process: the kernel
compares the total number of tasks belonging to this real UID across the whole machine. One
process leaking threads can therefore starve every other process the same user owns.
Max locked memory is in bytes, and 8388608 is 8 MiB, the modern
default; older systems show 65536, a number with consequences we will get to.
Two smaller observations. "Max core file size 0" soft against unlimited hard is the standard
arrangement for core dumps: disabled by default, but any process may opt in by raising its
own soft limit, which is exactly the soft/hard design doing its job. And "unlimited" in this
file is a real value (RLIM_INFINITY), not an absence; for most resources it
means what it says, with one important exception for nofile that lives in the pitfalls.
When the limit is hit, this is what the failure looks like from underneath, courtesy of
strace. The error anatomy is worth recognising on sight:
# the per-process fence: soft RLIMIT_NOFILE exhausted accept4(7, ..., SOCK_CLOEXEC) = -1 EMFILE (Too many open files) # the machine-wide pool: fs.file-max exhausted (rare on modern boxes) openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = -1 ENFILE (Too many open files in system) # dmesg says: VFS: file-max limit 1000000 reached # the user-wide task budget: RLIMIT_NPROC exhausted clone(...) = -1 EAGAIN (Resource temporarily unavailable)
Three different errors, three different fences, one shared symptom of "the service stopped
working." The English strings are close enough to confuse a tired engineer; the errno names
are not. EMFILE sends you to /proc/PID/limits and then to
lsof to find out what all those descriptors are.
ENFILE sends you to sysctl fs.file-max and a hunt for whichever
process is eating the global pool. EAGAIN from fork or
clone sends you counting tasks, not files.
Three production scenarios
The proxy that drowned at 1024
A reverse proxy has run for months. Marketing buys a spot during something televised,
traffic triples, and the proxy starts refusing connections with EMFILE in its
logs. A proxy holds two descriptors per connection, one to the client and one upstream, plus
listeners, logs, and resolver sockets, so a four-digit connection count walks straight into
a 1024 fence. The interesting work is not raising the number, it is tracing where the number
came from, because the fix has to land at the layer that set it:
$ grep 'open files' /proc/9131/limits Max open files 1024 524288 files $ systemctl show proxy.service -p LimitNOFILE -p LimitNOFILESoft LimitNOFILE=524288 LimitNOFILESoft=1024 # stopgap, takes effect immediately, survives nothing: $ sudo prlimit --pid 9131 --nofile=65536:524288 # real fix, survives restarts: $ sudo systemctl edit proxy.service # add: [Service] LimitNOFILE=65536 $ sudo systemctl daemon-reload && sudo systemctl restart proxy.service
Walk the chain. Was the process started from a shell? No, so every ulimit in
every dotfile is irrelevant. Was it started by systemd? Yes, so its limits are exactly what
the unit gave it, and systemctl show confirms the soft value is the inherited
default, untouched by the unit file. Is it in a container? Then there is one more layer:
the runtime sets limits on the container's init process and everything inside inherits from
there, so the same investigation moves inside with
docker exec proxy cat /proc/1/limits, and the fix becomes
--ulimit nofile=65536:524288 on the run command or the daemon's
default-ulimits. Same fences, one more place they can be planted. The prlimit
stopgap is genuinely useful here: it unbreaks production in two seconds, on the live
process, while the unit-file change waits for a calm moment to restart.
fork: retry: Resource temporarily unavailable
A deploy user's cron jobs start failing, and anyone who tries sudo -u deploy bash
gets fork: retry: Resource temporarily unavailable. Nothing is out of memory.
This is RLIMIT_NPROC, and the detail that cracks the case is the one from the
annotated output above: the count is per user, across every process that UID owns,
and threads count as tasks. The usual culprit is not ten thousand processes; it is one
process with ten thousand leaked threads:
$ ps -eLF --no-headers -u deploy | wc -l 63988 $ ps -o pid,nlwp,comm -u deploy --sort=-nlwp | head -3 PID NLWP COMMAND 55012 63871 sync-worker # one PID owns nearly all of them: a thread leak 55400 9 gunicorn $ grep processes /proc/55012/limits Max processes 63990 63990 processes
63988 tasks against a fence of 63990: every new thread or fork by this user fails, including
the innocent cron jobs and your debugging shell, while the leaking process itself sits there
healthy because it is not trying to spawn anything. Restarting the leaker frees the budget
instantly; the durable fix is in its code. Two adjacent fences are worth knowing while you
are here. systemd also enforces TasksMax= per service through the cgroup pids
controller, which produces the same EAGAIN with a different ceiling
(systemctl show svc -p TasksMax reveals it), and it is not an rlimit at all. And
the per-user task accounting is one of the reasons running everything as one shared UID on a
box is a quiet act of self-sabotage: one tenant's leak takes out every tenant's fork.
The 64-kilobyte fence nobody remembers planting
A perf engineer ships a tool built on io_uring, or an SRE runs an eBPF tracer,
and it dies on startup with ENOMEM or a permission error on a machine with two
hundred free gigabytes. The fence here is RLIMIT_MEMLOCK, the cap on memory a
process may lock so it can never be paged out. Ring buffers shared with the kernel have to
be locked, registered io_uring buffers count against it, and on kernels before
5.11 every eBPF map and program was charged to it too, which is why older hosts are where
this bites hardest. The historical default soft limit was 64 KiB, a number from an era
when locking memory was exotic, and 64 KiB disappears the moment a modern tool sets up
its first ring:
$ grep 'locked memory' /proc/61204/limits Max locked memory 65536 65536 bytes # io_uring_queue_init: Cannot allocate memory # bpf(BPF_MAP_CREATE, ...) = -1 EPERM (kernels before 5.11) $ sudo prlimit --pid 61204 --memlock=1073741824:1073741824 # live # or in the unit file: LimitMEMLOCK=1G (and restart)
Newer stacks have softened this from both ends: systemd raised its default to 8 MiB,
and kernel 5.11 moved BPF accounting to the cgroup memory controller, so the same tracer
that needed LimitMEMLOCK=infinity on an old host runs untouched on a new one.
That inconsistency is itself the lesson. The error a tool throws depends on the kernel, the
distro, and which init started it, so skip the folklore and read the process's own limits
file first. It takes four seconds and it ends arguments.
What is underneath: inheritance, and the two roads from PID 1
All of the layer confusion dissolves once you see the mechanism. Resource limits are a small
table inside the kernel's per-process state, sixteen pairs of numbers, nothing more. When a
process calls fork(), the child gets a copy of the parent's table. When it
calls execve() to become a different program, the table survives untouched.
That is the entire propagation model: limits flow down the process tree by copying, at
creation time, and no process ever re-reads any configuration file afterwards. There is no
subscription, no refresh, no daemon watching limits.conf. A limit is wherever
it is because some ancestor set it before the fork, or because something with enough
privilege reached in later with prlimit(). The fork-and-exec machinery itself
is the subject of
processes; limits are
simply one more thing it copies.
Now the two roads. When systemd starts a service, PID 1 forks, sets the limits the unit
demands, and execs the daemon. That is the whole story for the service path: there is no
login, no PAM session, and therefore /etc/security/limits.conf is never opened.
That file is not system configuration in any general sense. It is the input to one specific
PAM module, pam_limits.so, which runs only when something opens a PAM session:
an SSH login, a console login, su -, a cron job on most distributions. So
limits.conf governs the limits of human sessions and the things humans launch,
while unit files govern daemons, and the two never touch. This is the answer to the
perennial mystery ticket: "I raised nofile in limits.conf, rebooted, and the service still
has 1024." Of course it does. The service's road never passes that file.
The same model explains the other classic. "I edited limits.conf, why does my running daemon
still show the old value?" would be unanswerable if limits were a live setting; it is
obvious once you know they are copied at fork. The daemon's table was written once, at
start, from whatever its parent had plus whatever its parent applied. Configuration edited
afterwards is just bytes on disk that nothing will read until the next start. The only ways
to change a running process's limits are from inside (the process calls
setrlimit() on itself) or from outside with prlimit(), which is
exactly what the prlimit tool wraps. Everything else is a restart wearing a
disguise.
Pitfalls
Raising the soft limit and ignoring the hard one, or the reverse. The pair
moves under different rules, and half-raised limits fail in confusing ways. Set
LimitNOFILE=65536 in a unit and systemd sets both values to 65536, which
silently lowers the hard limit from 524288 and takes headroom away from a process
that would have raised itself. Use the two-value form, LimitNOFILE=8192:524288,
when you mean different numbers. In a shell, plain ulimit -n 65536 sets both;
and if you lower the hard limit in a shell experiment, that shell cannot undo it, because
lowering a hard limit is the irreversible unprivileged move. New shell, fresh table.
Forgetting the container runtime is a fifth fence-setter. A containerised
process inherits from the runtime, not from your host shell and not from the service manager
inside the container image. Docker takes --ulimit nofile=4096:8192 per
container and default-ulimits in the daemon config, and the runtime's own
service unit feeds the chain above that. The cautionary tale is the containerd unit that
shipped LimitNOFILE=infinity for years: containers inherited a hard limit
around a billion, and programs that "close all possible descriptors before exec" by looping
from 0 to the hard limit, a common daemonising idiom, sat burning CPU for minutes. Limits
that are too high break things too, just more creatively.
Confusing fs.file-max with the per-process limit. They are different fences
around different fields. Raising fs.file-max when a process throws
EMFILE does nothing, because the process hit its own nofile fence, not the
machine's pool. The reverse mistake also happens: cranking per-process limits to the moon on
a box whose global pool is modest just moves the failure to a worse place, since
ENFILE hits every process at once instead of the greedy one.
Believing "unlimited" means unlimited for nofile. For most resources
RLIM_INFINITY is genuine. For open files it cannot be: the kernel requires the
hard nofile limit to stay at or below fs.nr_open, and a
setrlimit() asking for more is refused with EPERM, a confusing
errno for "the number was too big." systemd papers over this by translating
LimitNOFILE=infinity into the current fs.nr_open value, which is
tidy but means "infinity" quietly equals about a million, and a different million on a host
where someone tuned the sysctl. If a service needs a known descriptor budget, write the
number.
Checking the shell when the patient is a daemon. Worth restating as a
pitfall because it is the single most common limits mistake: ulimit -n in your
SSH session describes your SSH session. The daemon took the other road. Check
/proc/PID/limits, always, and treat any limits claim that is not backed by that
file as a rumour.
A drill you can run right now
Everything below is safe on any Linux machine, shared ones included: it reads state, lowers a limit only inside throwaway child processes, and restores everything by letting those children exit. Ten minutes, and soft versus hard, the two roads, and the live-change trick stop being theory.
Step 1 — read your own table, twice. Run ulimit -a and read
every line; this is the soft column. Run ulimit -Hn and compare it to
ulimit -n; on most distros you will find the 1024-against-a-much-bigger-number
split this page keeps returning to. Then get the same answer from the kernel with
cat /proc/$$/limits (the shell expands $$ to its own PID) and
confirm the builtin and the file agree. They must: they are two views of one table.
Step 2 — compare your shell to a daemon. Pick any running service and put
the two roads side by side: systemctl show sshd -p MainPID -p LimitNOFILESoft,
then cat /proc/THATPID/limits (root may be needed to read another user's
process). Notice what matches and what does not against your own shell's table: the daemon's
numbers came from a unit file and PID 1, yours came through PAM and
limits.conf, and nothing forces them to agree.
Step 3 — hit the fence on purpose. Lower the soft nofile limit inside a child shell and watch an ordinary command die, while your own shell stays untouched:
$ bash -c 'ulimit -n 3; cat /etc/hostname' cat: /etc/hostname: Too many open files $ ulimit -n 1024
Sit with why that failed. A nofile limit of 3 permits descriptors 0, 1, and 2, which the
child shell already holds as stdin, stdout, and stderr. cat needs descriptor 3
to open the file, the kernel refuses with EMFILE, and you have just reproduced
the production error with a one-line command. The second command shows the other half of the
lesson: the parent shell still reads 1024, because the child got a copy of the
table and its changes died with it. Inheritance is copying; copying is one-way.
Step 4 — change a live process with prlimit. Start a victim, squeeze it, inspect it, release it:
$ sleep 600 & [1] 70233 $ prlimit --pid 70233 --nofile=64:1024 $ grep 'open files' /proc/70233/limits Max open files 64 1024 files $ prlimit --pid 70233 --nofile=1024:1024 $ kill %1
No restart, no signal, no cooperation from the process: the table changed while
sleep ran, and /proc showed the change instantly. You could do
this unprivileged because the process is yours and the values stayed at or under the hard
limit. During a real incident the same two commands, with sudo and bigger numbers, are the
difference between "fixed in ten seconds" and "restarted the only healthy instance."
cat /proc/PID/limits is the truth for
any running process; prlimit --pid PID --nofile=S:H changes it live; and systemd
services take their limits from the unit file, never from limits.conf.Further reading
- getrlimit(2) / setrlimit(2) / prlimit(2) — the system-call view: every RLIMIT_* resource, the soft/hard rules, and the exact errno for each refusal.
- systemd.exec(5) — the Limit*= directives, the soft:hard syntax, and the documented translation of "infinity" for nofile.
- Lennart Poettering — File descriptor limits — the reasoning behind the 1024 soft / 524288 hard default, from the person who set it.
- proc(5) — documentation for /proc/PID/limits and the rest of the per-process files this page leans on.