06 / 08
Errors / 06

too many open files

The process hit its file-descriptor limit. Since every socket, pipe, and epoll instance is an fd, this is usually a connection leak or a default 1024 limit on a busy server — rarely about actual files.


The symptom

Any syscall that creates a file descriptor starts failing: accept on servers (each new connection needs an fd), open, socket, pipe. Servers often log it in a retry loop, which is why one saturated process can emit thousands of these lines a minute.

http: Accept error: accept tcp [::]:8080: accept4: too many open files; retrying in 5ms
EMFILE: too many open files, open "/var/app/data.json"      (Node.js)
OSError: [Errno 24] Too many open files                     (Python)
  ← EMFILE = this process’s limit. The rarer ENFILE means the whole SYSTEM
    hit fs.file-max — dmesg then shows "VFS: file-max limit reached"

The diagnosis

1 Read the limit the process actually has

$ cat /proc/4112/limits | grep "open files"
Max open files            1024                 524288               files
                          ↑ soft (what you hit)  ↑ hard (raisable to)
  ← read the PROCESS’s limits, not your shell’s: ulimit -n in your terminal
    says nothing about what systemd or the container runtime gave the daemon

The soft limit is what EMFILE enforces; the hard limit is how far it can be raised without privilege. A 1024 soft limit on a server that holds thousands of connections is finished before the first byte of debugging — but confirm usage in step 2 before concluding the limit is the problem.

2 Count the open fds, and watch the trend

$ ls /proc/4112/fd | wc -l
1023        ← parked at the soft limit: confirmed
$ watch -n5 "ls /proc/4112/fd | wc -l"
  ← growing under STEADY load   → leak (cause 1)
  ← plateaued, scales with load → under-provisioned limit (cause 2)

The trend is the diagnosis. A leak grows without bound at constant traffic and resets on restart — the restart interval is your leak rate. Legitimate usage tracks concurrency and flattens. This one distinction decides whether the fix is code or configuration.

3 See what the descriptors are

$ lsof -p 4112 | awk "{print \$5}" | sort | uniq -c | sort -rn | head
   1843 IPv4      ← overwhelmingly sockets: a connection-shaped problem
     31 REG       ← regular files: not the issue here
     12 FIFO
$ ss -tnp state close-wait | grep "pid=4112" | wc -l
1781        ← the smoking gun: CLOSE_WAIT means the peer hung up and
              this process never called close() on its side

Sockets dominating points at connections; regular files dominating points at file handling (and check lsof for "deleted" entries — the disk-full investigation’s cousin). A pile of CLOSE_WAIT is conclusive: the remote side closed, the kernel is waiting for your process to close its half, and it never does. That is a code-level leak — an unclosed response body, a forgotten connection in an error path — with a file and line number waiting to be found.

The causes, ranked

  1. 1 A descriptor leak — usually sockets on error paths

    confirm fd count grows under steady load; CLOSE_WAIT sockets pile up; restart resets the count and the clock.

  2. 2 The default 1024 soft limit on a legitimately busy server

    confirm fd count plateaus near the limit and tracks concurrency; no CLOSE_WAIT accumulation; the fds are healthy ESTABLISHED sockets.

  3. 3 fd-hungry patterns: a watcher or epoll per task

    confirm lsof shows large counts of anon_inode (epoll, timerfd, eventfd) or inotify entries rather than sockets.

  4. 4 ENFILE — the system-wide table is full

    confirm "VFS: file-max limit reached" in dmesg; cat /proc/sys/fs/file-nr shows allocations at the ceiling. Errors hit many processes at once.

The fixes

A descriptor leak — usually sockets on error paths

Close in the paths that don’t run on sunny days: error returns, early exits, timeouts. In Go, every http.Response body must be closed even when you don’t read it; in most languages the cure is defer/finally/with around anything that opens. Cap idle pool sizes so even correct code can’t hoard.

The default 1024 soft limit on a legitimately busy server

Raise it where the process actually launches: for systemd services, LimitNOFILE=65536 in a drop-in (systemctl edit, then daemon-reload and restart); for containers, the runtime’s ulimit settings. Shell ulimit only affects children of that shell — it is the right fix almost nowhere in production.

fd-hungry patterns: a watcher or epoll per task

Share the machinery: one watcher with many registrations instead of one per file, one event loop per pool rather than per job. For inotify specifically there are separate sysctls (fs.inotify.max_user_watches) that exhaust before the fd limit does.

ENFILE — the system-wide table is full

Find the hog first — sum fds per process from /proc/*/fd — because one leaking process can starve the box. Raise fs.file-max only if the aggregate demand is genuinely legitimate.

What people get wrong

  • ulimit -n in your shell changes nothing for the daemon. Limits are inherited at process creation. A systemd service reads LimitNOFILE from its unit; a container gets the runtime’s defaults. Checking /proc/<pid>/limits for the real process — not running ulimit in a fresh SSH session — is the difference between fixing it and believing you did.
  • Raising the limit on a leak reschedules the incident. A leak consumes any limit; doubling it doubles the time to the next page, at a busier hour. The CLOSE_WAIT check in step 3 tells you which case you have before you spend the config change.
  • "We don’t open files" rules nothing out. Sockets, pipes, epoll instances, eventfds, timerfds — everything is a descriptor (getrlimit(2) RLIMIT_NOFILE counts them all). A web service with no file I/O at all can exhaust its limit on connections alone.

Quick answers

What’s the quick fix for "too many open files"?

First check whether it’s a leak: count fds (ls /proc/<pid>/fd | wc -l) and check for CLOSE_WAIT sockets. If the count grows under steady load, fix the leak — raising limits only delays the recurrence. If usage is legitimate, raise the soft limit where the process launches: LimitNOFILE for systemd units, runtime ulimits for containers.

Why do I have thousands of sockets in CLOSE_WAIT?

CLOSE_WAIT means the remote side closed the connection and your process has not called close() on its descriptor. The kernel will wait indefinitely. It is always a local code bug — typically an unclosed response body or a connection abandoned on an error path — never a network problem.

How do I permanently raise the limit for a systemd service?

systemctl edit <service>, add [Service] LimitNOFILE=65536, then systemctl daemon-reload and restart the service. Verify with cat /proc/<new-pid>/limits. Editing /etc/security/limits.conf affects PAM login sessions, not systemd services — a classic mis-fix.

Related on Semicolony

Found this useful?