too many open files
The process hit its file-descriptor limit. Since every socket, pipe, and epoll instance is an fd, this is usually a connection leak or a default 1024 limit on a busy server — rarely about actual files.
The symptom
Any syscall that creates a file descriptor starts failing: accept on servers (each new connection needs an fd), open, socket, pipe. Servers often log it in a retry loop, which is why one saturated process can emit thousands of these lines a minute.
http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms
EMFILE: too many open files, open "/var/app/data.json" (Node.js)
OSError: [Errno 24] Too many open files (Python)
← EMFILE = this process’s limit. The rarer ENFILE means the whole SYSTEM
hit fs.file-max — dmesg then shows "VFS: file-max limit reached"The diagnosis
1 Read the limit the process actually has
$ cat /proc/4112/limits | grep "open files"
Max open files 1024 524288 files
↑ soft (what you hit) ↑ hard (raisable to)
← read the PROCESS’s limits, not your shell’s: ulimit -n in your terminal
says nothing about what systemd or the container runtime gave the daemon The soft limit is what EMFILE enforces; the hard limit is how far it can be raised without privilege. A 1024 soft limit on a server that holds thousands of connections is finished before the first byte of debugging — but confirm usage in step 2 before concluding the limit is the problem.
2 Count the open fds, and watch the trend
$ ls /proc/4112/fd | wc -l 1023 ← parked at the soft limit: confirmed $ watch -n5 "ls /proc/4112/fd | wc -l" ← growing under STEADY load → leak (cause 1) ← plateaued, scales with load → under-provisioned limit (cause 2)
The trend is the diagnosis. A leak grows without bound at constant traffic and resets on restart — the restart interval is your leak rate. Legitimate usage tracks concurrency and flattens. This one distinction decides whether the fix is code or configuration.
3 See what the descriptors are
$ lsof -p 4112 | awk "{print \$5}" | sort | uniq -c | sort -rn | head
1843 IPv4 ← overwhelmingly sockets: a connection-shaped problem
31 REG ← regular files: not the issue here
12 FIFO
$ ss -tnp state close-wait | grep "pid=4112" | wc -l
1781 ← the smoking gun: CLOSE_WAIT means the peer hung up and
this process never called close() on its side Sockets dominating points at connections; regular files dominating points at file handling (and check lsof for "deleted" entries — the disk-full investigation’s cousin). A pile of CLOSE_WAIT is conclusive: the remote side closed, the kernel is waiting for your process to close its half, and it never does. That is a code-level leak — an unclosed response body, a forgotten connection in an error path — with a file and line number waiting to be found.
The causes, ranked
- 1 A descriptor leak — usually sockets on error paths
confirm fd count grows under steady load; CLOSE_WAIT sockets pile up; restart resets the count and the clock.
- 2 The default 1024 soft limit on a legitimately busy server
confirm fd count plateaus near the limit and tracks concurrency; no CLOSE_WAIT accumulation; the fds are healthy ESTABLISHED sockets.
- 3 fd-hungry patterns: a watcher or epoll per task
confirm lsof shows large counts of anon_inode (epoll, timerfd, eventfd) or inotify entries rather than sockets.
- 4 ENFILE — the system-wide table is full
confirm "VFS: file-max limit reached" in dmesg; cat /proc/sys/fs/file-nr shows allocations at the ceiling. Errors hit many processes at once.
The fixes
Close in the paths that don’t run on sunny days: error returns, early exits, timeouts. In Go, every http.Response body must be closed even when you don’t read it; in most languages the cure is defer/finally/with around anything that opens. Cap idle pool sizes so even correct code can’t hoard.
Raise it where the process actually launches: for systemd services, LimitNOFILE=65536 in a drop-in (systemctl edit, then daemon-reload and restart); for containers, the runtime’s ulimit settings. Shell ulimit only affects children of that shell — it is the right fix almost nowhere in production.
Share the machinery: one watcher with many registrations instead of one per file, one event loop per pool rather than per job. For inotify specifically there are separate sysctls (fs.inotify.max_user_watches) that exhaust before the fd limit does.
Find the hog first — sum fds per process from /proc/*/fd — because one leaking process can starve the box. Raise fs.file-max only if the aggregate demand is genuinely legitimate.
What people get wrong
- ulimit -n in your shell changes nothing for the daemon. Limits are inherited at process creation. A systemd service reads LimitNOFILE from its unit; a container gets the runtime’s defaults. Checking /proc/<pid>/limits for the real process — not running ulimit in a fresh SSH session — is the difference between fixing it and believing you did.
- Raising the limit on a leak reschedules the incident. A leak consumes any limit; doubling it doubles the time to the next page, at a busier hour. The CLOSE_WAIT check in step 3 tells you which case you have before you spend the config change.
- "We don’t open files" rules nothing out. Sockets, pipes, epoll instances, eventfds, timerfds — everything is a descriptor (getrlimit(2) RLIMIT_NOFILE counts them all). A web service with no file I/O at all can exhaust its limit on connections alone.
Quick answers
What’s the quick fix for "too many open files"?
First check whether it’s a leak: count fds (ls /proc/<pid>/fd | wc -l) and check for CLOSE_WAIT sockets. If the count grows under steady load, fix the leak — raising limits only delays the recurrence. If usage is legitimate, raise the soft limit where the process launches: LimitNOFILE for systemd units, runtime ulimits for containers.
Why do I have thousands of sockets in CLOSE_WAIT?
CLOSE_WAIT means the remote side closed the connection and your process has not called close() on its descriptor. The kernel will wait indefinitely. It is always a local code bug — typically an unclosed response body or a connection abandoned on an error path — never a network problem.
How do I permanently raise the limit for a systemd service?
systemctl edit <service>, add [Service] LimitNOFILE=65536, then systemctl daemon-reload and restart the service. Verify with cat /proc/<new-pid>/limits. Editing /etc/security/limits.conf affects PAM login sessions, not systemd services — a classic mis-fix.