kill & signals
A deploy needs the old release to exit cleanly. nginx needs to pick up new config without
dropping a single connection. A runaway batch job needs to stop right now. Three different
requests, one mechanism underneath: signals, the only asynchronous message bus the kernel
gives you for talking to a running process. This page covers the eight signals worth knowing,
why kill -9 is the last word and not the first, how to decode exit code 137,
what kill can never fix, and a drill that makes all of it muscle memory.
The question it answers
How do you make a running process stop, reload, or talk to you, without breaking anything?
You cannot call into it; it is busy running its own code. You cannot write to its memory.
What the kernel gives you instead is a signal: a small integer, sent asynchronously, that
interrupts whatever the target is doing and either runs a function the process registered
in advance or applies a default action the kernel chose decades ago. That is the entire
interface. No payload to speak of, no reply, no acknowledgement. It sounds too thin to be
useful, and yet every clean shutdown, every Ctrl-C, every nginx -s reload, and
every Kubernetes pod termination you have ever seen runs through it.
The kill command is badly named, and the name does real damage. It does not
kill; it sends a signal, any signal, and only some of those signals mean "die."
kill -HUP 1290 asks nginx to reload its config. kill -USR1 1290
asks it to reopen its log files. kill -STOP 41327 freezes a process mid-flight
and kill -CONT 41327 thaws it. A bare kill 41327 sends SIGTERM,
which is a request to exit, one the process is allowed to handle, delay, or even refuse.
Only kill -9 is an order, and the difference between a request and an order is
the difference between a service that flushes its buffers, finishes its in-flight work, and
exits 0, and a service that simply ceases to exist between one instruction and the next.
In practice signals split into four families. Lifecycle signals end things: TERM, INT, QUIT, and KILL, in increasing order of violence. Convention signals carry an agreed-upon meaning that the kernel does not enforce: HUP has come to mean "reload your configuration" for daemons. Application-defined signals, USR1 and USR2, mean whatever the program's author decided they mean, which is why you read the man page before sending one. And job-control signals, STOP and CONT, pause and resume execution without ending anything, which is the mechanism behind Ctrl-Z in your shell and behind more than one creative production workaround. Different families, one mental model: a number arrives, and the process's registered disposition for that number decides what happens.
It also helps to know what signals are not. They are not messages: a standard signal carries no data beyond its own number, so you cannot send "reload, but only the TLS section." They are not queued: if the same standard signal is sent five times before the target gets a chance to act, it is delivered once, because pending signals are a bitmask, not a list. And they are not guaranteed to be acted on promptly, or at all: a process blocked in certain kernel waits will not see your signal until the kernel lets it, which is the seed of the most confusing failure mode this page covers. For anything richer, pipes, sockets, and shared memory exist, and they are the subject of interprocess communication. Signals are the doorbell, not the conversation.
The signals that matter
Linux defines more than sixty signals. Eight of them cover essentially all the operational work you will ever do, and two of those eight are special in a way that the table flags and the rest of the page keeps coming back to.
| Signal | No. | Default action | What it is for |
|---|---|---|---|
SIGTERM | 15 | Terminate | The polite default. Handlers run, cleanup happens, buffers flush. What kill sends when you name nothing. |
SIGKILL | 9 | Terminate | Uncatchable, unblockable. No handler runs, no cleanup, no flushing. The last resort, never the first move. |
SIGHUP | 1 | Terminate | Originally "the terminal hung up." For daemons, the near-universal convention for "reload your config." |
SIGINT | 2 | Terminate | What Ctrl-C sends to the foreground process group. Interactive interruption. |
SIGQUIT | 3 | Terminate + core dump | Ctrl-\ at the terminal. Like INT, but leaves a core file behind for the autopsy. |
SIGUSR1 / SIGUSR2 | 10 / 12 | Terminate | App-defined. nginx reopens logs on USR1; dd prints progress; meaning varies, so check first. |
SIGSTOP | 19 | Stop | The pause button. Uncatchable, like KILL. The process freezes wherever it is. |
SIGCONT | 18 | Continue | Resumes a stopped process exactly where it froze. |
TERM is the workhorse, and it is worth being precise about what makes it polite. When a process receives SIGTERM, the kernel gives the process's own code a chance to run first. A well-written server has registered a handler that stops accepting new work, lets in-flight requests finish, flushes whatever it was buffering, releases its locks, tells its peers goodbye, and then exits with status 0. None of that is automatic; the handler has to exist and has to be correct. But the opportunity is the point. TERM says "please wrap up," and a process that respects it leaves the world consistent behind it.
KILL removes the opportunity. Signal 9 is one of exactly two signals a process cannot
catch, block, or ignore, and the kernel does not so much deliver it as act on it: the
process is destroyed without ever running another instruction of its own code. Think
through what that costs. Buffered writes that never reached disk are gone. A half-written
file stays half-written. Lock files and PID files stay on disk, lying about a process that
no longer exists. Database clients vanish mid-transaction and leave the server to time them
out. Child processes are orphaned and reparented rather than shut down. Shared memory
segments keep whatever inconsistent state they held at the instant of death. Every one of
those is a cleanup that a TERM handler would have done and that nothing will do now. That
is why kill -9 belongs at the end of an escalation, after TERM was sent and
given honest time to work, and not in anyone's muscle memory as the opener.
HUP is the interesting historical case. On real terminals it meant the line had literally
hung up, and the default action, terminate, made sense: your shell died, so your jobs
should too. (nohup exists precisely to shield a command from that fate.)
Daemons have no terminal, so the signal was sitting there unused, and a convention grew up
around it: HUP means "re-read your configuration." nginx, PostgreSQL, sshd, rsyslog, and
most daemons you can name honor it. The convention is social, not technical; the kernel has
no idea what "reload" means, and a daemon with no HUP handler will simply die, because the
default action never went away. Which is why you check before you send: HUP a process that
handles it and you get a graceful reload, HUP one that does not and you have just terminated
it.
STOP and CONT are the pair most engineers forget exist. STOP is the second uncatchable signal: the process freezes wherever it happens to be, mid-write, mid-request, holding every lock and socket it held, and stays frozen until CONT arrives. Nothing is lost and nothing is cleaned up, because nothing ends. This is occasionally exactly what you want: pausing a backup that is saturating disk I/O until peak traffic passes, or freezing a misbehaving process so it stops doing damage while you attach a debugger and read its state at leisure. The freeze has side effects, though. Health checks time out, peers see a connection that has gone silent, and watchdogs may decide the process is dead and act on that belief. Pause with intent.
kill PID, then wait,
ten seconds or more for anything with real state. Check with ps -p PID. Still
there? Look at why (a slow handler? D state?) before reaching for
kill -9 PID. The order matters because TERM can clean up and KILL cannot.Reading the output
Three pieces of output do most of the teaching here: the signal list itself, what a graceful shutdown looks like next to an ungraceful one, and the exit-code arithmetic that lets you read a process's cause of death off a single number.
$ kill -l 1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1 11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM 16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP …and 44 more, mostly real-time signals you will never send by hand
kill -l is the built-in cheat sheet, and the numbers matter because they show
up again in exit codes. Note where the highlighted three sit: 9 and 19 are the uncatchable
pair, 15 is the default. The names work everywhere the numbers do, and they read better in
a runbook: kill -TERM 41327 and kill -15 41327 and plain
kill 41327 are the same request.
Here is the difference between TERM and KILL as it actually appears, in the logs of a service that handles TERM properly. First, the graceful path:
$ kill 41327 # meanwhile, in the service's log: 12:04:31.118 INFO signal: SIGTERM received, beginning shutdown 12:04:31.119 INFO listener: closed *:8080, no new connections 12:04:31.120 INFO drain: waiting on 14 in-flight requests 12:04:33.846 INFO drain: complete, all requests finished 12:04:33.901 INFO db: connection pool closed cleanly 12:04:33.902 INFO shutdown complete, exit 0
Two and a half seconds of orderly wind-down: the listener closes first so nothing new
arrives, fourteen requests in flight get to finish, the database pool says a proper goodbye,
and the exit status is 0, which tells the supervisor this was an intended exit. Now the
same service, taken out with -9:
$ kill -9 41327 # meanwhile, in the service's log: 12:09:02.412 INFO request: GET /api/orders/8841 200 (34ms) 12:09:02.487 INFO request: POST /api/orders 201 (8 █ — nothing. the log ends mid-line. that is the whole story.
No farewell, no flush, not even the end of the line it was writing. The fourteen in-flight
requests of the previous example would have died with their connections; their clients see
a reset and retry, or do not. The truncated log line is itself a diagnostic: when you find
one at the end of a dead service's log, the process did not exit, it was extinguished,
either by a human with -9 or by the kernel's out-of-memory killer, which uses
the same signal.
Which brings us to the arithmetic. When a process is terminated by a signal, the shell (and most supervisors, and Kubernetes) report its exit status as 128 plus the signal number. This one fact lets you read cause of death off a status code:
$ ./run-batch.sh; echo $? 137 # decode: 137 − 128 = 9 → SIGKILL. someone, or something, sent -9. # # the table you end up memorising: # 130 = 128 + 2 → SIGINT (Ctrl-C) # 137 = 128 + 9 → SIGKILL (kill -9, the OOM killer, or K8s after grace) # 139 = 128 + 11 → SIGSEGV (segfault) # 143 = 128 + 15 → SIGTERM (a clean kill the process did not catch)
137 is the one worth tattooing somewhere, because it is the tell for two of the most common
production deaths that nobody admits to. When the kernel runs out of memory it picks a
victim and sends SIGKILL, so a container that exits 137 with OOMKilled in its
status was executed by the kernel for its memory usage. And when Kubernetes terminates a
pod that did not exit within its grace period, the final blow is also SIGKILL, so 137 with
no OOM marker often means "the app ignored SIGTERM and ran out the clock." Same number, two
different incidents, and the next section walks the second one properly. A 143, by
contrast, is almost reassuring: the process died to TERM without a handler, which is blunt
but at least immediate.
Three production scenarios
The pod that had thirty seconds to live
Kubernetes does not have a special way to stop your application. It has signals, used on a
timer. When a pod is deleted, for a rollout, a node drain, an eviction, two things happen
in parallel: the pod is removed from its Service endpoints so new traffic stops arriving,
and the kubelet sends SIGTERM to PID 1 of each container (after running the
preStop hook, if you defined one). Then a clock starts:
terminationGracePeriodSeconds, thirty seconds by default. If your process
exits before the clock runs out, the pod ends cleanly. If it does not, the kubelet sends
SIGKILL, and your container exits 137 with whatever state it happened to have.
Read that timeline as a contract. Kubernetes promises to ask nicely first and to give you a
fixed window; your application promises to treat SIGTERM as "drain and exit," and to do it
inside the window. An application that ignores TERM does not get a different lifecycle, it
gets the same lifecycle with the worst ending: thirty wasted seconds, then a SIGKILL with
requests mid-flight. Every deploy becomes a small outage. The fixes are unglamorous: handle
TERM in your server framework (most have a shutdown hook waiting to be wired up), set
terminationGracePeriodSeconds longer than your slowest legitimate drain, and
remember that endpoint removal is not instantaneous, so a short preStop sleep
before the drain begins covers the gap where traffic is still arriving. The full sequence,
hooks and all, is taken apart in
pod lifecycle.
"I killed it, but it's still in ps"
You send TERM, then KILL, and the process is still listed. Before sending anything else, look at its state, because there are exactly two common ways this happens and neither of them can be fixed with another signal. Check the STAT column:
$ ps -o pid,stat,wchan,comm -p 18230 24410 PID STAT WCHAN COMMAND 18230 Z - worker <defunct> 24410 D rpc_wait_bit_kil backup-job
Z is a zombie, and a zombie is already dead. The process finished, or your
earlier signal worked, and everything it owned has been released: memory, descriptors,
sockets, all gone. What remains is one entry in the process table holding its exit status,
kept around because its parent has not yet called wait() to read it. You
cannot kill it again; there is nothing left to kill, and signals to a zombie are discarded.
The bug is in the parent, which is failing to reap its children. Find it with
ps -o ppid= -p 18230 and fix or restart it; if the parent dies, the zombie is
reparented to init, which reaps it immediately. One zombie is cosmetic. A
growing herd of them is a parent leaking process-table entries, and the process table is
finite.
D is uninterruptible sleep, and it is the opposite problem: the process is
alive but the kernel will not let your signal interrupt it. D state means the process is
blocked inside a kernel operation that cannot be safely abandoned partway, classically I/O
against a hung NFS server or a dying disk. Your KILL is not refused; it is pending,
sitting in the bitmask, and it will be acted on the instant the kernel operation completes.
If the NFS server never answers, that instant never comes, and no signal of any kind, not
even 9, changes that. The WCHAN column above is the clue: rpc_wait_bit_kil is
an NFS RPC wait. The fix is never another signal; it is fixing the I/O the process is stuck
on, or unsticking the mount it is waiting against. To see exactly which file or mount that
is, read the process's descriptors and syscall directly:
/proc shows what it holds and
strace shows where it is stuck.
Reload the config without dropping a connection
The config change is ready, and the naive path, restart the service, drops every active
connection for however long the restart takes. The grown-up path is the HUP convention,
and nginx is its best showcase. Send kill -HUP to the nginx master process
(or run nginx -s reload, which does the same thing) and the master re-reads
and validates the configuration, and only if it parses cleanly does anything change: new
worker processes start under the new config, the old workers stop accepting connections
but keep serving the ones they have until each finishes, then exit. From the outside,
nothing happened. No connection was dropped, no port was ever unbound, and a config typo
results in a logged error and the old config staying live, rather than an outage.
PostgreSQL follows the same convention with its own dialect: HUP to the postmaster (or
pg_reload_conf() from a session) re-reads postgresql.conf, and
parameters marked reloadable take effect without touching a single connection, while
parameters that need a restart log that fact and wait. nginx pushes the idea further with
USR2, which forks a new binary alongside the old one for zero-downtime upgrades of
the executable itself, the two masters briefly sharing the listening sockets. The pattern
to internalize: for any long-lived daemon, the man page's SIGNALS section is the daemon's
real admin API, and a reload signal is almost always available and almost always cheaper
than a restart. The only rule is to confirm the convention before trusting it, because a
daemon that never registered a HUP handler will take the default action and die, politely,
with exit code 129.
What happens underneath
Everything above gets easier to reason about once you can see the machinery, and the
machinery is small. Sending a signal is a system call: kill(2) takes a PID
and a signal number, and the kernel checks permission, you may signal a process if you are
root or if it runs as your user, and then does something almost anticlimactic: it sets one
bit in the target's pending signal set, a per-process bitmask with one position
per signal. That is the send, complete. Nothing about the target is interrupted yet, and
the sender gets no information back beyond "the bit was set."
The second half happens on the target's schedule. At the next transition from kernel mode
back to user mode, returning from a system call, coming back from an interrupt, being
picked to run again by the scheduler, the kernel checks the pending set. If a bit is set
and not masked, the kernel consults that signal's disposition, a per-signal
setting with three possible values. Default: the kernel applies its built-in action for
that signal, which is terminate for most, terminate-with-core-dump for the fault family
(QUIT, SEGV, ABRT), stop for the stop family, and ignore for a few like CHLD. Ignore: the
signal is discarded as if never sent. Handler: the kernel pauses the process's normal flow,
arranges a call into the function the process registered with sigaction(2),
and resumes the original flow when the handler returns. The process can also temporarily
block signals with a second mask; a blocked signal stays pending and is acted on
when unblocked, which is how programs protect critical sections from interruption. When the
target acts is tied to when it next runs at all, which is the scheduler's call; the
mechanics of that hand-off live in
scheduling.
Now the two privileged signals make sense. KILL and STOP are not stronger signals that
punch through handlers; the kernel simply refuses to let any process change their
disposition. A sigaction call for signal 9 or 19 returns an error, full stop.
This is a deliberate design position: the system guarantees that a sufficiently privileged
operator always has a working kill switch and a working pause button that no buggy or
hostile process can disarm. The price of that guarantee is everything this page has been
warning about, because a signal that cannot be caught is also a signal that cannot trigger
cleanup. The guarantee and the price are the same fact viewed from two sides.
Two more pieces of the machinery explain things you will eventually see. First, standard
signals do not queue: the pending set is a bitmask, so sending TERM ten times to a busy
process delivers it once. If you need queued signals with payloads, the real-time range
(SIGRTMIN and up) provides them, though at that point you are usually better served by an
actual IPC channel; the trade-offs are covered in
interprocess communication.
Second, dispositions are inherited across fork() and handlers are reset across
exec(), which is why a daemon's children start with sensible defaults rather
than their parent's handler pointing into memory that no longer exists. The descriptor
tables, process states, and parent-child mechanics this all hangs off are the subject of
processes.
Pitfalls
kill -9 as a habit. The most common signal mistake is not technical, it is reflexive: years of tutorials saying "just kill -9 it" have trained a generation to open with the one signal that forbids cleanup. Most of the time you get away with it, because most processes most of the time have nothing important in flight, and that intermittent reward is exactly how bad habits calcify. The time you do not get away with it, the process was a database mid-checkpoint, or a worker holding a distributed lock with no timeout, and the cleanup that never ran becomes its own incident. Send TERM, give it real seconds, then escalate. The pause costs you almost nothing; skipping it occasionally costs a lot.
pkill and killall match more than you meant. pkill java
signals every process whose name matches "java", which on a shared box may include
three services that were not in your blast radius, and pkill -f matches
against the entire command line, where an innocent pattern like -f backup
can match an editor session with "backup" in a filename. The discipline is mechanical:
run the same pattern through pgrep -a first, read the full list of what would
be signalled, and only then change the g to a k. Also remember that pkill's exit status
means "matched at least one process," not "matched the process you were thinking of."
PID 1 in a container ignores your signals. The kernel treats PID 1
specially: signals whose disposition is still the default are simply not delivered to it,
on the logic that init must not be killable by accident. Inside a container, your
application is PID 1, and the consequence is a famous gotcha: docker stop
sends SIGTERM, the default-action TERM is never delivered, nothing happens for the grace
period, and then SIGKILL (which is exempt from the rule) ends things the bad way. Every
such container takes ten silent seconds to stop and never runs cleanup. The shell-form
ENTRYPOINT makes it worse by putting a shell at PID 1 that does not forward
signals to your actual process, and a PID 1 that never calls wait() also
accumulates zombies. The fixes are standard: use the exec form so your app is PID 1 and
handles TERM explicitly, or put a minimal init like tini in front
(docker run --init, or Kubernetes with a shared process namespace), whose
whole job is forwarding signals and reaping children.
Signal handlers that do too much. A handler can run between any two
instructions of your program, including halfway through malloc() updating its
internal bookkeeping or printf() holding a lock on a stream. If the handler
then calls malloc or printf itself, it re-enters code that is
mid-mutation, and the result is a deadlock or corruption that reproduces once a month.
POSIX keeps a short list of async-signal-safe functions a handler may call; the practical
discipline is stricter and simpler: a handler should set a flag (a
sig_atomic_t, or write one byte to a pipe the main loop watches, the
self-pipe trick) and return, with the real shutdown logic running in the main loop where
normal rules apply. Runtimes like Go and Java do this translation for you, turning signals
into channel messages and hook threads. If you write handlers by hand, keep them almost
insultingly small.
A drill you can run right now
Everything below is safe on any Linux or macOS machine, including a shared one: it creates
one throwaway script in /tmp and signals only processes you start yourself.
Ten minutes, and the three big ideas, the TERM handler, the uncatchable pair, and the
128+N arithmetic, become things you have watched happen.
Step 1 — catch a TERM with your own hands. Write a five-line script that
registers handlers the way a real daemon does, using the shell built-in trap:
$ cat > /tmp/trapdemo.sh <<'EOF' #!/usr/bin/env bash trap 'echo "caught SIGTERM — cleaning up"; exit 0' TERM trap 'echo "caught SIGHUP — pretending to reload config"' HUP echo "running as PID $$ — try kill -HUP, then plain kill" while true; do sleep 1; done EOF $ bash /tmp/trapdemo.sh & running as PID 7311 — try kill -HUP, then plain kill $ kill -HUP 7311 caught SIGHUP — pretending to reload config # still running: HUP did not end it, the handler did its job and returned $ kill 7311 caught SIGTERM — cleaning up # this is the entire graceful-shutdown pattern, in five lines of bash
Watch what each send did. HUP arrived, the handler printed its line and returned, and the
loop went on, exactly how nginx survives a reload. TERM arrived, the handler ran its
cleanup and chose to exit 0, exactly the contract Kubernetes is counting on during every
rollout. One detail worth noticing: the reaction can lag by up to a second, because bash
waits for the running sleep to finish before running the trap; even a polite
signal is acted on when the target gets around to it.
Step 2 — the pause button. Start a long sleep, freeze it, look at its state, and thaw it:
$ sleep 600 & [1] 7402 $ kill -STOP 7402 && ps -o pid,stat,comm -p 7402 PID STAT COMM 7402 T sleep $ kill -CONT 7402 && ps -o pid,stat,comm -p 7402 PID STAT COMM 7402 S sleep $ kill 7402
The STAT column flips to T, stopped, and back to S, sleeping.
Try adding a TERM while it is stopped: the signal goes pending and is acted on the moment
CONT arrives, which is the pending-set bitmask from the diagram made visible. This is also
your model for D state, with one difference that matters: a stopped process resumes when
you say so, a D-state process resumes when its I/O does, and no signal hurries either one.
Step 3 — read a cause of death off the exit code. Kill two sleeps two different ways and ask the shell what it saw:
$ sleep 300 & [1] 7515 $ kill -9 7515; wait 7515; echo $? [1]+ Killed sleep 300 137 $ sleep 300 & [1] 7522 $ kill 7522; wait 7522; echo $? [1]+ Terminated sleep 300 143
There it is: 137 is 128 plus 9, and 143 is 128 plus 15. The next time a container status page, a CI job, or a supervisor log shows you 137, you will not be guessing; you will know a SIGKILL ended that process, and the only remaining question is whether a human sent it, the OOM killer sent it, or an ignored SIGTERM ran out a grace period. Three suspects, one number, and you have now produced the number yourself. (One incident this page deliberately leaves for its neighbor: when the process you killed is gone but its port is still taken, the answer involves sockets, not signals, and lives in what's holding this port?)
kill PID first, always — it is
TERM, and TERM can clean up. Escalate to kill -9 only after waiting and
checking. And read 137 as "128 + 9": somebody's SIGKILL, the kernel's or yours.Further reading
- signal(7) — the overview man page — the full signal table, default actions, and the async-signal-safe function list; the single best reference on this topic.
- sigaction(2) — how handlers are actually registered, and where the kernel's refusal to bend on SIGKILL and SIGSTOP is written down.
- Kubernetes docs — pod termination — the authoritative account of the TERM-grace-KILL sequence and where preStop hooks fit in it.
- tini — a tiny but valid init for containers — the README is a sharp, short explanation of the PID 1 signal and zombie-reaping problem it exists to solve.