strace
The application says it loaded the config. The log says the connection succeeded. The
process says it is working. And yet nothing works. Logs describe what a program
believes it did; system calls record what it actually asked the kernel to do, and
the kernel does not flatter anyone. strace sits on that boundary and writes
down every request and every answer. This page covers the flags worth knowing, decodes a
real trace line by line, walks three production incidents, explains the ptrace machinery
underneath and its very real cost, and ends with a drill that will not hurt anything.
The question it answers
A process cannot do anything interesting on its own. It can compute in registers and shuffle its own memory around, and that is the whole list. To open a file, send a packet, read input, start a child, learn the time, or even exit, it has to ask the kernel, and every one of those requests is a system call. That makes the syscall boundary the one place where a program's behaviour is fully visible and fully honest. The application can lie in its logs, swallow an exception, or log "connected to database" before the connection actually completes. It cannot lie about the syscalls it makes, because those are not statements about its behaviour — they are its behaviour.
strace records that behaviour. For every system call the traced process makes, it
prints one line: the call's name, its decoded arguments, and what the kernel said back. When a
service hangs, the trace shows you the exact call it is stuck inside. When a program ignores
the config file you swear it should be reading, the trace shows every path it tried and the
kernel's verdict on each one. When a container dies before it manages to log a single line,
the trace catches its last words. The tool's whole value is that it removes a layer of
interpretation: instead of asking the program what it thinks happened, you watch what
happened.
It helps to be clear about scope. strace sees system calls and signals, nothing
else. A process burning CPU in a tight loop of pure computation makes no syscalls and produces
an empty trace; for that problem you want
what's eating my CPU? instead. It
also shows you each call as the kernel decodes it, not as your code wrote it: your program
called fopen(), but the trace shows openat(), because that is what
libc translated it into. That gap is a feature once you are used to it. You are seeing the
request after every wrapper, framework, and runtime has finished rephrasing it, which is
exactly the version of events the wrappers might be hiding. The full anatomy of how a call
crosses this line lives in
system calls, and
there is an interactive walk through one crossing in the
syscall journey simulator.
The five flags that matter
Like most tools of its vintage, strace has accumulated a manual page far longer
than its useful surface. Five flags cover nearly all real diagnostic work. Learn these and you
can postpone the rest of the man page indefinitely.
| Flag | What it does | When you reach for it |
|---|---|---|
-c | No live output; counts calls, errors, and time per syscall, then prints a summary table at exit | First contact with an unfamiliar problem — the shape before the detail |
-p 31774 | Attaches to an already-running process instead of starting one | The service is hung right now and you cannot restart it to find out why |
-f | Follows children across fork, vfork, and clone | Almost always. Anything with workers, threads, or a shell wrapper does its real work in a child |
-e trace=openat | Filters to named syscalls, or classes: -e trace=file, -e trace=network | Cutting a firehose down to the one question you are asking |
-tt -T | Wall-clock timestamps with microseconds (-tt) plus each call's duration (-T) | Anything involving the words "slow," "hang," or "sometimes" |
Two helpers round these out. -s 256 raises the string truncation limit: by
default strace prints only the first 32 bytes of any string argument, which is
just enough to show you the beginning of a path or a write buffer and hide the part you
needed. And -o trace.txt sends output to a file instead of stderr, which matters
more than it sounds — traces of real programs run to tens of thousands of lines, and you want
to grep them, not scroll them. A serviceable default invocation for investigating a live
process looks like this:
$ sudo strace -f -tt -T -s 256 -o /tmp/trace.txt -p 31774 strace: Process 31774 attached with 12 threads ^C # detach when you have enough; the process keeps running
strace and watch text scroll. Resist it. strace -c first gives you
the syscall histogram: which calls dominate, which ones fail, where the time goes. A column
showing 4 000 openat errors, or 99% of wall time inside poll,
tells you which filtered trace to run next. The summary is the map; the full trace is the
street view. Use them in that order.Reading a trace
Every line of strace output has the same grammar, and once you can parse it the rest of this
page is easy. The shape is name(arguments) = return-value, optionally followed by
an errno symbol and the kernel's plain-English explanation in parentheses. A non-negative
return means success — for openat it is the new file descriptor, for
read the byte count. A return of -1 means failure, and the errno
that follows is the diagnosis. That last part deserves emphasis, because it is the single most
useful habit this tool teaches: the errno is the answer. = -1 ENOENT (No such file or directory) is not noise after the number, it is the
kernel telling you, in so many words, why the thing your program tried did not work.
An openat storm: the config search
Here is one of the most common patterns you will ever see in a trace — a program hunting for its configuration file through a list of fallback paths. Failed lookups are not bugs here; they are the search working as designed. The line that matters is the one where the return value finally goes positive.
$ strace -e trace=openat ./myapp 2>&1 | grep config openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/home/deploy/.config/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/opt/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "./config.yaml", O_RDONLY) = 3
Read it as a story. The first argument, AT_FDCWD, says the path is resolved
relative to the current working directory (only relevant for the last, relative one). The
second argument is the path being tried, the third is the open mode. Three attempts come back
-1 ENOENT: the file does not exist at that path, says the kernel, and the kernel
is in a position to know. The fourth returns 3 — a real file descriptor, the
search is over, and now you know with certainty which file this program reads. Note the
2>&1 on the command: strace writes its output to stderr, so you have to
merge streams before you can pipe it through grep.
A connect that goes nowhere
Network calls decode just as cleanly, and this is where -tt -T earns its place in
the default invocation. A struct argument is printed field by field, so the address and port
the process is dialling are right there in the trace.
$ strace -tt -T -e trace=network ./myapp 2>&1 | tail -3 14:02:09.114220 socket(AF_INET, SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 <0.000019> 14:02:09.114501 connect(3, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.0.9.40")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000043> 14:02:09.114688 poll([{fd=3, events=POLLOUT}], 1, 30000) = 0 <30.000482>
The first timestamp column is -tt; the duration in angle brackets at the end of
each line is -T. The connect on a non-blocking socket returns
EINPROGRESS, which is not an error — it means "dialling, check back." The process
then parks in poll waiting for the socket to become writable, with a 30-second
timeout. The poll returns 0, meaning nothing became ready, after
30.000482 seconds. Translation: the process spent thirty full seconds trying to
reach 10.0.9.40:5432 and got silence — no acceptance, no refusal, the signature of
a host that is down or a firewall that drops packets. The application log for this same event
said "retrying database connection." The trace says which host, which port, and exactly how
long the silence lasted.
The read/EAGAIN idle loop
One more pattern, because misreading it wastes people's afternoons. On non-blocking
descriptors, EAGAIN means "no data right now, try later" — and event-driven
programs generate them as a matter of routine.
epoll_wait(5, [{events=EPOLLIN, data=...}], 64, -1) = 1 <4.180112>
read(7, "GET /healthz HTTP/1.1\r\nHost:..."..., 65536) = 412
read(7, 0x55d31be8a000, 65536) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, ... The loop wakes from epoll_wait when a socket has data, reads the 412 bytes that
arrived, reads again to drain the socket, gets EAGAIN because there is nothing
left, and goes back to sleep. Every line of that is healthy. An event loop's trace is
supposed to be full of EAGAINs; they are how the program discovers it has finished
draining a socket. The pathological version looks different: read returning
EAGAIN thousands of times per second with no epoll_wait between them is a busy
loop, a bug, and a CPU furnace. Same errno, opposite meanings, and the difference is visible
only in the rhythm of the surrounding calls. This is the general skill the tool builds — not
memorising syscalls, but recognising the rhythms: search storms, dial-and-wait, drain loops.
Three production scenarios
The service that "hangs" on startup
A deploy goes out, the new instance starts, and then nothing: no listening port, no log output, no crash. The supervisor says it is running. Twenty minutes of staring at logs that end mid-thought tells you nothing, because the process is not failing — it is waiting, and programs do not log what they are waiting for. Attach and look:
$ sudo strace -f -tt -p 48112 strace: Process 48112 attached 14:21:40.005281 connect(4, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.0.12.7")}, 16 ...the cursor sits here, mid-line...
That unfinished line is the whole diagnosis. The process is blocked inside a
connect to 10.0.12.7:6379 — a Redis host, and evidently a dead or
unreachable one, because a healthy connect returns in microseconds. The application never
logged "connecting to cache" before making the call, so from the outside it looked like a
mystery hang. From the syscall boundary it is a one-line answer with the address attached. The
fix might be the dead host, a stale config pointing at an old IP, or a missing timeout in the
client — but the twenty minutes of guessing is over in the first second of trace.
Which config file does it actually read?
You changed the setting. You restarted the service. The behaviour did not change. Somewhere there is a second copy of the config, or an environment-specific override, or a default baked into a package, and the program is reading that one instead of yours. Documentation will tell you where the config should live; only the trace tells you where the process actually looked. Run the startup under a filtered trace and keep only the opens that succeeded:
$ strace -f -e trace=openat -o /tmp/t.txt ./service --start $ grep -v ENOENT /tmp/t.txt | grep -E '\.(yaml|conf|ini|json)' openat(AT_FDCWD, "/etc/service/service.conf", O_RDONLY) = 3 openat(AT_FDCWD, "/etc/service/conf.d/99-override.conf", O_RDONLY) = 3
There is the culprit: a conf.d override, loaded after the main file, quietly
winning every merge. This technique generalises to any "which file does it really use"
question — which CA bundle, which DNS resolver config, which Python interpreter on a machine
with four of them. Trace openat, filter out the ENOENTs, and read the survivors.
Note that this answers a different question from
lsof: lsof shows what a process holds open
right now, and a config file read at startup was opened, parsed, and closed long
before you went looking. Files that are touched rather than held are
tracing's territory.
The container that exits instantly
A container starts and dies in under a second, exit code 127 or 1, with either no output at
all or an unhelpful one-liner. The image worked in staging. Nothing useful reaches the logs
because the failure happens before logging is initialised — typically in the dynamic linker or
the entrypoint script, the part of startup no application developer thinks about. Run the
entrypoint under strace -f (the -f is mandatory here, because
entrypoint scripts spawn the real binary as a child) and read the last twenty lines:
$ strace -f -e trace=file ./entrypoint.sh 2>&1 | tail -5 [pid 9214] openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libssl.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) [pid 9214] writev(2, ..., "error while loading shared libraries: libssl.so.3"..., ...) = 94 [pid 9214] exit_group(127) = ? [pid 9214] +++ exited with 127 +++
The binary was built against a newer libssl than the base image carries; the linker searched,
found nothing, wrote one line to stderr that the container runtime dropped, and died. The
other classic ending is execve(...) = -1 EACCES (Permission denied) — an
entrypoint that lost its execute bit in the build, or a filesystem mounted
noexec. Either way, the pattern is the same: when a process dies before it can
speak, its syscall trace is the only testimony available, and the last few lines almost always
contain a failing call followed by an exit_group with a non-zero status.
Underneath: ptrace, and what it costs
strace is not kernel magic and it has no privileged window into the machine. It
is an ordinary user-space process built on one old, strange syscall: ptrace, the
same primitive debuggers like gdb are built on. When strace attaches to a process, it asks the
kernel to make it that process's tracer. From then on, the kernel stops the traced
process at every syscall entry and every syscall exit, and lets the tracer inspect it while it
is frozen — registers, memory, everything. That is the whole mechanism. The trace lines you
read are strace decoding the frozen process's registers (which hold the syscall number and
arguments) and reading argument buffers out of its memory, once on the way into the kernel and
once on the way out.
Look at what that loop costs. A normal syscall is one quick round trip across the boundary. Under strace, the same syscall involves two complete stops of the traced process, two wakeups of the tracer, several ptrace calls to read registers and memory, and the context switches to match. A call that takes a microsecond untraced can take fifty or a hundred traced. For a program that makes a handful of syscalls per second, nobody notices. For a syscall-heavy workload — a busy web server, a database, anything doing small reads and writes in a hot loop — strace can slow the workload by 10x to 100x, and on a saturated production service that slowdown is itself an outage. This is not a flaw to be patched; it is the price of the design, paid on every call.
The practical rule: strace is a scalpel for processes that are stuck, broken, or mysterious,
where correctness of the answer matters and the process is doing little anyway. It is the
wrong instrument for "where does the time go" on a healthy busy service. For that, use a
sampling profiler — perf interrupts the process a hundred-odd times a second to
record where it is, which costs a rounding error instead of a multiplier; the trade-offs are
covered in profiling. And when
you want syscall-level visibility on production without the ptrace tax, the modern path is
eBPF: programs that run inside the kernel at the same boundary and aggregate in place, no
stopping, no relay. That toolchain gets its own treatment in
eBPF observability. Think of
strace, perf, and eBPF as three observers of the same boundary at three price points, and pick
by how hot the thing you are observing is.
Pitfalls
Attaching to a hot production process casually. The overhead section above,
restated as a warning label: strace -p on the busiest process on the box can take
a degraded service and finish it off. Before attaching during an incident, ask how
syscall-heavy the process is and whether it is already at the edge. Prefer
-e trace=... filters (fewer stops relayed to you, though the process still stops
for every call), keep the attachment short, and know that detaching with Ctrl-C leaves the
process running normally. If the box is truly on fire, reach for sampling tools first and save
the scalpel for a quieter moment or a sacrificial instance.
Forgetting -f. The classic. You trace a service's launcher, see a brief
flurry of execve and clone, and then your trace goes quiet while the
application visibly does things. Everything interesting is happening in a child process you
are not following. Shell wrappers, process supervisors, pre-fork servers, thread pools —
modern software is mostly children. There is rarely a good reason to omit -f;
type it by reflex and read the [pid NNNN] prefixes to keep the actors straight.
Trusting truncated strings. The default 32-byte limit on printed strings
means the trace shows write(2, "error while loading shared libra"..., 94) and
cuts off exactly before the name of the library you needed. The ... at the end of
a string is the tell. When the content of buffers matters — error messages, paths, HTTP
headers, SQL — rerun with -s 256 or larger. Nothing is more deflating than
catching the failure perfectly and finding the punchline truncated.
Containers refuse to be traced. Run strace inside a container and you will
likely get ptrace: Operation not permitted, even as root inside the container.
Default seccomp and capability profiles block ptrace because a process that can trace another
can also puppet it. You need the SYS_PTRACE capability — with Docker that is
--cap-add=SYS_PTRACE at run time, and in Kubernetes the cleaner route is an
ephemeral debug container that shares the target pod's process namespace, traced from there.
Decide how you will get a trace out of your platform before the night you need one.
Expecting strace to see everything. Two blind spots surprise people. Pure
computation makes no syscalls, so CPU-bound spins are invisible. And reads or writes to
memory-mapped files go through page faults, not read/write, so a
program using mmap for I/O can move gigabytes while its trace shows almost
nothing. An empty trace does not mean an idle process; it means a process that is not asking
the kernel for anything new.
A drill you can run right now
Everything below is safe on any Linux machine: each step traces either a throwaway command or
a sleep you started yourself. Ten minutes, and the summary table, the trace
grammar, and the attach/detach cycle stop being theory.
Step 1 — the summary table. Trace something tiny and look at the shape first, the way you would on a real problem:
$ strace -c ls > /dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 21.42 0.000241 15 16 mmap 13.61 0.000153 17 9 openat 11.92 0.000134 19 7 read 9.79 0.000110 18 6 close ... ------ ----------- ----------- --------- --------- ---------------- 100.00 0.001125 118 4 total
Even ls on an empty terminal makes over a hundred syscalls, most of them the
dynamic linker mapping libraries before main() ever runs. The errors column on a
healthy command is usually non-zero too — harmless probe-and-fallback lookups. Sit with that
for a second: a baseline of failed syscalls is normal, which is why "find the error" is never
as simple as grepping for -1.
Step 2 — the grammar, live. Filter to one syscall and read every line out loud:
$ strace -e trace=openat cat /etc/hosts > /dev/null openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3 openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3 +++ exited with 0 +++
Three opens belong to the runtime — the linker cache, libc itself, locale data — and only the last one is the file you asked for. This is the config-search scenario from earlier in miniature, and it teaches the other essential reading habit: most of any trace is the runtime clearing its throat, and you learn to skip to the part where the program starts speaking.
Step 3 — attach and detach. Practise the live-attach cycle on a process that cannot possibly mind:
$ sleep 300 & [1] 52214 $ strace -p 52214 strace: Process 52214 attached restart_syscall(<... resuming interrupted read ...> ← blocked, mid-call, exactly like the hung service ^C strace: Process 52214 detached $ kill %1
Notice two things. First, the trace shows an unfinished line: sleep is
blocked inside a syscall, waiting for its timer, and that frozen mid-call line is precisely
what the hung-startup scenario looks like in the wild — you have now seen the signature on a
harmless target. Second, Ctrl-C detached strace and the sleep kept running; attaching is not a
death sentence for the target. If the attach fails with Operation not permitted,
you have met either the privilege rule (you can only trace your own processes without root) or
the Yama ptrace_scope hardening, both of which are better discovered now than
during an incident. For extra credit, rerun step 2 with -tt -T and watch
the timestamps appear, then go poke at /proc/52214/status while attached and find
the TracerPid field naming your strace — a preview of the next page.
strace -c CMD for the shape,
strace -f -tt -T -s 256 -e trace=... -o file for the detail, and when a process
hangs with no explanation: attach, read the one frozen line, detach. The errno is the answer.Further reading
- strace(1) — the manual page
— the
-efiltering expressions and the syscall class names (file, network, signal, ipc) are the sections worth a careful read. - ptrace(2) — the primitive underneath; the description of syscall-stops explains everything about the overhead.
- Julia Evans — the strace zine — the friendliest possible introduction, and proof that the tool fits on a few hand-drawn pages.
- Semicolony — System calls — what is actually happening at the boundary strace watches.