15 / 28

Linux / 15

What's holding this port?

The deploy fails. The log says bind: address already in use and the new release refuses to start. This error has exactly two shapes: either something is listening on the port, or the old process is gone but its sockets are not. The whole investigation fits in a handful of commands, and the hard part is never running them — it is reading the answers and deciding what is safe to do next. This page walks the investigation the way you would at 3am: one command, one decision, repeat until the port is yours.

Two shapes of the same error

Here is the failure as it usually arrives — a service that started, tried to bind, and died:

$ journalctl -u app -n 3 --no-pager
Jun 08 14:02:11 web-3 app[52110]: starting on 0.0.0.0:8080
Jun 08 14:02:11 web-3 app[52110]: Error: listen EADDRINUSE: address already in use 0.0.0.0:8080
Jun 08 14:02:11 web-3 systemd[1]: app.service: Main process exited, code=exited, status=1/FAILURE

EADDRINUSE comes from the kernel's bind() call, and the kernel refuses for one of two reasons. Either another socket is already in the LISTEN state on that address and port — a live process holds it — or no process holds it at all, but sockets from a previous life of the service are still draining through TCP's shutdown states and the kernel will not hand the address out again yet. The first shape has a culprit with a PID. The second shape has no culprit; killing things will not help, because there is nothing to kill.

Everything that follows is about telling those two shapes apart fast, and then making the right move for whichever one you have. The shortcut version, if you have seen all this before, is three commands.

The fast version. sudo ss -tlnp 'sport = :8080' — is anything listening, and what is its PID? If a listener shows up: ps -fp PID and decide whether it should die and how. If nothing is listening: ss -tan 'sport = :8080' — TIME-WAIT rows mean you need SO_REUSEADDR in the app or sixty seconds of patience, not a kill.

The whole investigation as a tree. Every branch below is one section of this page; the leaf you land on tells you which fix applies.

Step one: ask ss who is listening

Start with ss, because it asks the kernel's socket tables directly and answers in milliseconds even on a loaded box. You want listening TCP sockets, numeric output, with the owning process, filtered to the one port you care about:

$ sudo ss -tlnp 'sport = :8080'
State    Recv-Q   Send-Q   Local Address:Port   Peer Address:Port   Process
LISTEN   0        4096     0.0.0.0:8080         0.0.0.0:*           users:(("java",pid=38104,fd=89))

Each letter of -tlnp earns its place: -t for TCP, -l for sockets in the listening state only, -n for raw numbers instead of slow name lookups, -p for the process column on the right. The quoted 'sport = :8080' filter narrows the answer to the source port you care about — quote it, because = and : mean things to your shell. If you can never remember the filter syntax, sudo ss -tlnp | grep 8080 gets you the same rows with less elegance, and during an incident nobody is grading elegance.

Run it with sudo. Without root, ss -p can only name processes you own; sockets held by other accounts still show up as rows, but with an empty process column, which is exactly the kind of half-answer that sends people down wrong paths.

Now read what came back. One row means one listener, and the process column hands you everything: the command name, the PID, and the file descriptor holding the socket. Before doing anything else, classify the listener, because each kind has a different right move:

What the process column says	What it usually means	The move
Your own service, an older PID	The previous release never died, or an orphaned worker survived the restart	Identify how it was started, stop it that way — see below
Your own service, a PID seconds old	A supervisor already restarted it; you are racing systemd or docker	Stop fighting the supervisor; stop the unit, not the PID
A different service entirely	A port collision: two things configured onto 8080	Nobody needs killing; one of them needs a config change
`docker-proxy` or `kube-proxy`	The port is published from a container or a Service	The fix lives in the container layer — see the proxy section
Empty process column (with sudo)	Rare: a kernel-space holder or a socket in another namespace	Cross-check with lsof, then suspect containers

One subtlety about the address column before moving on. A listener on 0.0.0.0:8080 (or *:8080 for IPv6) claims the port on every interface, so anything else binding 8080 anywhere will collide with it. But a listener on 127.0.0.1:8080 only claims the loopback side — a new process binding 10.0.4.12:8080 would succeed alongside it. When the row's local address is specific and your bind is the wildcard, the wildcard loses. That is occasionally the entire incident: someone bound a debug instance to localhost, and the real service wants the wildcard. The mechanics of addresses, binds, and what a listening socket actually is are in sockets.

Step two: cross-check with lsof

If ss answered cleanly, you can skip ahead. But there are two cases where a second opinion from lsof earns its keep. The first is the empty process column: ss shows a LISTEN row but cannot say who owns it, usually because you forgot sudo, occasionally because the socket lives somewhere strange. The second is plain paranoia before a destructive action — you are about to kill something, and two tools agreeing is cheap insurance.

$ sudo lsof -nP -iTCP:8080 -sTCP:LISTEN
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    38104 deploy   89u  IPv6 798112      0t0  TCP  *:8080 (LISTEN)

Same answer, different angle: lsof walks /proc per process rather than reading the socket tables, so it also tells you the USER the process runs as and the descriptor number, both useful when you write the incident notes later. The -sTCP:LISTEN filter matters; without it you also get every established connection touching 8080, and right now those are noise. The flags, the column decoder, and why -nP is non-negotiable are all on the lsof page.

If both tools agree something is listening, jump to the "is it safe to kill" section. If both agree nothing is listening and the bind still fails, you have the second shape of the error, and it is the more interesting one.

Nothing is listening, and bind still fails

This is the shape that confuses people, because the mental model "a process owns the port" has run out. No process owns it. The kernel does. Drop the -l flag and look at all TCP sockets on that source port, not just listeners:

$ ss -tan 'sport = :8080'
State       Recv-Q   Send-Q   Local Address:Port   Peer Address:Port
TIME-WAIT   0        0        10.0.4.12:8080       10.0.9.55:49210
TIME-WAIT   0        0        10.0.4.12:8080       10.0.9.61:50112
TIME-WAIT   0        0        10.0.4.12:8080       10.0.9.55:49377

There they are: connections from the old instance, finished but not yet forgotten. When a TCP connection closes, the side that sent the first FIN — the active closer — must hold the connection's address pair in a state called TIME-WAIT for long enough that any packets still wandering the network can arrive and die harmlessly, instead of being mistaken for data on a brand-new connection that reused the same addresses. On Linux that hold is sixty seconds, a constant compiled into the kernel. (People often try to tune it with tcp_fin_timeout; that sysctl controls a different state, FIN-WAIT-2, and does not touch TIME-WAIT at all.) The full state machine, and why the wait is two times the maximum segment lifetime, is in TCP.

The active closer pays the TIME-WAIT tax. A restarting server closed dozens of connections first, so dozens of its port-8080 address pairs sit here for a minute.

So why does this block a fresh bind()? By default, the kernel refuses to bind a listening socket to a local port while any socket — even a dying one — still occupies that port. The escape hatch is the socket option SO_REUSEADDR, set before the bind, which tells the kernel: I accept the (tiny, mostly theoretical) risk, let me bind even though TIME-WAIT remnants exist. It is worth being precise about what this option does and does not do, because folklore has inflated it:

Claim	True?
`SO_REUSEADDR` lets you bind while old connections sit in TIME-WAIT	Yes. This is its actual job, and why every serious server sets it.
It lets you bind over a port someone else is LISTENING on	No. A live listener still wins; you still get EADDRINUSE.
It lets two processes listen on the same port at once	No. That is `SO_REUSEPORT`, a different option with a different purpose — see the prevention section.
It is dangerous and lets old data leak into new connections	Mostly no. TCP's sequence-number checks make confusion vanishingly unlikely; the option is standard practice in essentially all server software.

This re-frames the fix. If your investigation lands here — no listener, TIME-WAIT rows present — there is nothing to kill, and anything you do kill is collateral damage. The honest fixes are: wait sixty seconds (fine for a one-off, miserable as a habit), or fix the application so it sets SO_REUSEADDR before binding. Most frameworks do this for you; the ones that bite are hand-rolled servers and older runtimes where the option is off unless asked for. If your service hits this on every restart, the bug is in its socket setup, and the deploy script that "fixes" it with sleep 90 is a confession, not a cure.

One more case lives on this branch of the tree: no listener and no TIME-WAIT rows, yet the bind still fails. Before doubting the kernel, doubt your vantage point. Are you on the right machine? Is the application running inside a container with its own network namespace, so that the conflict exists in a namespace your host-level ss cannot see? ss run on the host shows the host's namespace only; for a container's view you need nsenter -t PID -n ss -tlnp or an exec into the container. The error message is generated wherever the bind ran. Look there.

You found the holder. Is it safe to kill?

Back to the first shape: a live listener with a PID. The reflex is kill -9 and redeploy, and the reflex is wrong often enough to spend ninety seconds checking. Three questions, in order: what is this process, who started it, and what is the polite way to stop it?

$ ps -o pid,ppid,user,lstart,cmd -p 38104
  PID   PPID USER   STARTED                       CMD
38104      1 deploy Tue Jun  2 09:14:33 2026     java -jar /opt/app/app-2.4.1.jar

Read this line slowly, because every field is a clue. The command string with the version number tells you whether this is the previous release (expected, safe to retire) or something else entirely (stop and ask around). The start time tells you whether this process predates the deploy — an old instance — or appeared two seconds ago, which means a supervisor is restarting it and you are about to play whack-a-mole. And the parent PID of 1 means its original parent is gone: either it was started by the init system directly, or it was orphaned and re-parented, which is exactly what happens when a worker survives its parent's death and keeps the inherited listening socket alive.

Then find out who is responsible for it, because the right way to stop a process is to ask its manager, not the process itself:

$ systemctl status 38104
● app.service - Payments API
     Loaded: loaded (/etc/systemd/system/app.service; enabled)
     Active: active (running) since Tue 2026-06-02 09:14:33 UTC
   Main PID: 38104 (java)
     CGroup: /system.slice/app.service
             └─38104 java -jar /opt/app/app-2.4.1.jar

systemctl status accepts a bare PID and tells you which unit, if any, owns it. Three outcomes, three moves. If it belongs to a systemd unit: systemctl stop app.service — never kill, because systemd will either restart what you killed or mark the unit failed in a way that confuses the next deploy. If the cgroup line says docker or the command is containerd-shim-adjacent: docker stop CONTAINER, which delivers a TERM, waits, then escalates — the same etiquette you would follow by hand. And if it belongs to nothing — no unit, no container, started from someone's terminal three weeks ago — then and only then is a manual signal the right tool: kill 38104 sends TERM, the process gets a chance to drain connections and exit cleanly, and you escalate to kill -9 only after TERM has had a few seconds and visibly failed. The difference between those signals, and why 9 should be the last resort instead of the reflex, is the whole subject of kill & signals.

After the stop, verify before you redeploy. The polite shutdown you just triggered closes connections, and closing connections creates — you guessed it — TIME-WAIT entries. The listener row should vanish from ss -tlnp within a second or two; if your app lacks SO_REUSEADDR, the TIME-WAIT remnants may still block it for a minute. Knowing both shapes of the error means this does not surprise you twice in one incident.

The supervisor trap: you kill it and it comes back

A classic mid-incident detour. You found the listener, killed it, ran ss again — and there it is, same command, new PID, listening smugly on 8080. You did not fail to kill it. Something is paid to bring it back.

$ sudo ss -tlnp 'sport = :8080'
LISTEN  0  4096  0.0.0.0:8080  0.0.0.0:*  users:(("java",pid=39882,fd=89))
# a moment ago this was pid 38104. someone restarted it.

The restarter is almost always one of three things: a systemd unit with Restart=always or Restart=on-failure, a container runtime with a restart policy, or a process-level supervisor (supervisord, pm2, a shell script in a while loop that someone wrote in 2019 and everyone forgot). The fastest way to find out which is the cgroup, because on a systemd machine every process's cgroup path spells out its ancestry:

$ cat /proc/39882/cgroup
0::/system.slice/app.service
# or, for a container:
0::/system.slice/docker-9f81c2e4a77b…scope

A system.slice/something.service path means systemd owns it; stop the unit, and if you need it to stay stopped while you work, remember that systemctl stop already suppresses the restart logic — Restart= policies fire on process death, not on deliberate stops. A docker-…scope path means a container; docker ps plus docker stop from there, and if the container has --restart=always, docker stop likewise wins where kill loses, because stopping through the engine records intent while a raw kill looks like a crash to be repaired. And if the cgroup is just your user session, the supervisor is a script or a tmux pane somewhere: ps -o ppid= -p PID and walk up parents until you find the loop.

The general law underneath the trap: kill processes through their managers. A supervisor's whole purpose is to disbelieve in spontaneous process death. Going around it does not defeat it; it recruits it against you.

When the holder is a proxy: containers and kubernetes

Sometimes the process column answers with a name that is technically correct and practically useless:

$ sudo ss -tlnp 'sport = :8080'
LISTEN  0  4096  0.0.0.0:8080  0.0.0.0:*  users:(("docker-proxy",pid=2219,fd=4))

docker-proxy is not your application. When you publish a container port with -p 8080:80, Docker sets up address translation in the kernel's netfilter tables and also starts this small userland helper, which holds the host port for the cases the kernel rules cannot cover (traffic from the host itself to its own published ports, mostly). Killing it changes nothing durable and breaks loopback access to the container in the meantime. The actual owner of the port is whichever container published it, and the question routes there:

$ docker ps --filter publish=8080
CONTAINER ID   IMAGE          PORTS                    NAMES
9f81c2e4a77b   app:2.3.9      0.0.0.0:8080->80/tcp     app-old
$ docker stop app-old

Kubernetes adds one more layer of indirection in the same spirit. On a node, a Service's NodePort traffic is usually handled by netfilter rules that kube-proxy programs, so ss may show kube-proxy holding the port (it binds NodePorts to reserve them and to fail fast on conflicts) or, in some configurations, show nothing listening at all while the port still answers — the redirect happens in the kernel before any socket would. Either way the lesson is identical: a proxy's name in the process column is a forwarding address, not a culprit. Find the workload behind it — docker ps, kubectl get svc -A | grep 8080 — and resolve the conflict at that layer: stop the container, change the published port, or pick a different NodePort. Reaching for kill at the proxy layer is treating the receptionist as the CEO.

A worked example, end to end

Here is the investigation run once, whole, the way it actually plays out. The setting: a deploy of app-2.4.2 to a box called web-3 fails with EADDRINUSE on 8080.

$ sudo ss -tlnp 'sport = :8080'
LISTEN  0  4096  0.0.0.0:8080  0.0.0.0:*  users:(("java",pid=31550,fd=89))

# a listener. who is pid 31550?
$ ps -o pid,ppid,user,lstart,cmd -p 31550
  PID   PPID USER   STARTED                   CMD
31550      1 deploy Mon Jun  1 22:40:09 2026  java -jar /opt/app/app-2.4.0.jar

# version 2.4.0 — two releases old, parent is pid 1. orphan? unit?
$ systemctl status 31550
Failed to get unit for PID 31550: PID 31550 does not belong to any loaded unit.
$ cat /proc/31550/cgroup
0::/user.slice/user-1004.slice/session-91.scope

# a login session, not a service: someone started 2.4.0 by hand a week ago
# and it has squatted on the port through two deploys. TERM it politely:
$ sudo kill 31550
$ sudo ss -tlnp 'sport = :8080'
# (no output — the listener is gone)
$ ss -tan 'sport = :8080' | head -3
State      Recv-Q Send-Q Local Address:Port  Peer Address:Port
TIME-WAIT  0      0      10.0.4.12:8080      10.0.9.55:49210
TIME-WAIT  0      0      10.0.4.12:8080      10.0.9.61:50112

# its dying connections went TIME-WAIT, as expected. our app sets
# SO_REUSEADDR, so this won't block the bind. deploy:
$ sudo systemctl start app && sleep 2 && curl -s localhost:8080/healthz
ok

Notice what the walkthrough never needed: kill -9, a reboot, or a guess. Each command's output chose the next command. The single judgement call — is the 2.4.0 process safe to terminate — was answered by evidence (wrong version, no unit, a human's login session) rather than hope. And the final ss -tan check meant the TIME-WAIT rows were an expected footnote instead of a second mystery.

Endings, and how to stop meeting this error

Every run of this investigation ends in one of a small set of places. The old instance was never stopped: fix the deploy script so stopping the old release is a step with its own verification, not a fire-and-forget. Two services were configured onto one port: move one, and consider making port assignments a reviewed file instead of tribal knowledge. The app lacked SO_REUSEADDR: set it; this is a one-line fix that deletes a whole class of restart pain. A human's hand-started process was squatting: that one is culture, not code.

Two pieces of prevention are worth building once and keeping. First, make the deploy script wait for the port to actually be free instead of binding into a race:

# after stopping the old release, before starting the new one:
for i in $(seq 1 30); do
  ss -tlnH 'sport = :8080' | grep -q . || break
  sleep 1
done
# -H drops the header, so "no output" cleanly means "port free"

Thirty seconds of patience, checked once a second, ends the race between the old listener's shutdown and the new bind. The same loop inverted — wait until the port answers — belongs after the start, as the health check that gates the deploy's success. Start, wait for listen, curl the health endpoint, only then call it done.

Second, for services that cannot afford the gap at all, SO_REUSEPORT exists: it lets several sockets listen on the same port simultaneously, with the kernel spreading incoming connections across them. The zero-downtime pattern is to start the new release alongside the old one — both listening, both serving — then tell the old one to stop accepting and drain. No gap, no race, no EADDRINUSE, because the port is never released at all. It costs real engineering (both versions serve traffic at once, so they must be compatible), which is why it is a deliberate choice for the services that need it rather than a default. The option's mechanics live in sockets.

Incident notes worth keeping. When you write this one up, capture the evidence, not just the outcome: the exact ss line showing the holder (command, PID, fd), the ps line with the start time and full command, what systemctl status PID or the cgroup said about ownership, the action you took and the signal you used, and whether TIME-WAIT showed up afterwards. Five lines of pasted terminal output turns "the port was stuck so I killed some stuff" into a record the next engineer can learn the method from — and it makes the pattern visible when the same squatter shows up three deploys in a row.

What's holding this port?

Two shapes of the same error

Step one: ask ss who is listening

Step two: cross-check with lsof

Nothing is listening, and bind still fails

You found the holder. Is it safe to kill?

The supervisor trap: you kill it and it comes back

When the holder is a proxy: containers and kubernetes

A worked example, end to end

Endings, and how to stop meeting this error

Further reading

16 — Why is the disk full?