05 / 28
Linux / 05

strace

The application says it loaded the config. The log says the connection succeeded. The process says it is working. And yet nothing works. Logs describe what a program believes it did; system calls record what it actually asked the kernel to do, and the kernel does not flatter anyone. strace sits on that boundary and writes down every request and every answer. This page covers the flags worth knowing, decodes a real trace line by line, walks three production incidents, explains the ptrace machinery underneath and its very real cost, and ends with a drill that will not hurt anything.


The question it answers

A process cannot do anything interesting on its own. It can compute in registers and shuffle its own memory around, and that is the whole list. To open a file, send a packet, read input, start a child, learn the time, or even exit, it has to ask the kernel, and every one of those requests is a system call. That makes the syscall boundary the one place where a program's behaviour is fully visible and fully honest. The application can lie in its logs, swallow an exception, or log "connected to database" before the connection actually completes. It cannot lie about the syscalls it makes, because those are not statements about its behaviour — they are its behaviour.

strace records that behaviour. For every system call the traced process makes, it prints one line: the call's name, its decoded arguments, and what the kernel said back. When a service hangs, the trace shows you the exact call it is stuck inside. When a program ignores the config file you swear it should be reading, the trace shows every path it tried and the kernel's verdict on each one. When a container dies before it manages to log a single line, the trace catches its last words. The tool's whole value is that it removes a layer of interpretation: instead of asking the program what it thinks happened, you watch what happened.

USER SPACEyour processcode, libraries, logs, beliefsstracealso a user-space processthe syscall boundary — the only honest interfaceKERNELfilesystem / VFSnetwork stackmemory / scheddevicesopenat / connect / read……and the return valuevia ptrace, the kernel stops theprocess at every crossing for stracelogs live above the line and can be wrong; the line itself cannot
strace does not look inside your program. It watches the boundary, where every meaningful action has to pass in one direction and every result in the other.

It helps to be clear about scope. strace sees system calls and signals, nothing else. A process burning CPU in a tight loop of pure computation makes no syscalls and produces an empty trace; for that problem you want what's eating my CPU? instead. It also shows you each call as the kernel decodes it, not as your code wrote it: your program called fopen(), but the trace shows openat(), because that is what libc translated it into. That gap is a feature once you are used to it. You are seeing the request after every wrapper, framework, and runtime has finished rephrasing it, which is exactly the version of events the wrappers might be hiding. The full anatomy of how a call crosses this line lives in system calls, and there is an interactive walk through one crossing in the syscall journey simulator.

The five flags that matter

Like most tools of its vintage, strace has accumulated a manual page far longer than its useful surface. Five flags cover nearly all real diagnostic work. Learn these and you can postpone the rest of the man page indefinitely.

FlagWhat it doesWhen you reach for it
-cNo live output; counts calls, errors, and time per syscall, then prints a summary table at exitFirst contact with an unfamiliar problem — the shape before the detail
-p 31774Attaches to an already-running process instead of starting oneThe service is hung right now and you cannot restart it to find out why
-fFollows children across fork, vfork, and cloneAlmost always. Anything with workers, threads, or a shell wrapper does its real work in a child
-e trace=openatFilters to named syscalls, or classes: -e trace=file, -e trace=networkCutting a firehose down to the one question you are asking
-tt -TWall-clock timestamps with microseconds (-tt) plus each call's duration (-T)Anything involving the words "slow," "hang," or "sometimes"

Two helpers round these out. -s 256 raises the string truncation limit: by default strace prints only the first 32 bytes of any string argument, which is just enough to show you the beginning of a path or a write buffer and hide the part you needed. And -o trace.txt sends output to a file instead of stderr, which matters more than it sounds — traces of real programs run to tens of thousands of lines, and you want to grep them, not scroll them. A serviceable default invocation for investigating a live process looks like this:

$ sudo strace -f -tt -T -s 256 -o /tmp/trace.txt -p 31774
strace: Process 31774 attached with 12 threads
^C  # detach when you have enough; the process keeps running
Start with -c, not with the firehose. The reflex is to run bare strace and watch text scroll. Resist it. strace -c first gives you the syscall histogram: which calls dominate, which ones fail, where the time goes. A column showing 4 000 openat errors, or 99% of wall time inside poll, tells you which filtered trace to run next. The summary is the map; the full trace is the street view. Use them in that order.

Reading a trace

Every line of strace output has the same grammar, and once you can parse it the rest of this page is easy. The shape is name(arguments) = return-value, optionally followed by an errno symbol and the kernel's plain-English explanation in parentheses. A non-negative return means success — for openat it is the new file descriptor, for read the byte count. A return of -1 means failure, and the errno that follows is the diagnosis. That last part deserves emphasis, because it is the single most useful habit this tool teaches: the errno is the answer. = -1 ENOENT (No such file or directory) is not noise after the number, it is the kernel telling you, in so many words, why the thing your program tried did not work.

An openat storm: the config search

Here is one of the most common patterns you will ever see in a trace — a program hunting for its configuration file through a list of fallback paths. Failed lookups are not bugs here; they are the search working as designed. The line that matters is the one where the return value finally goes positive.

$ strace -e trace=openat ./myapp 2>&1 | grep config
openat(AT_FDCWD, "/etc/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/deploy/.config/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/opt/myapp/config.yaml", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "./config.yaml", O_RDONLY) = 3

Read it as a story. The first argument, AT_FDCWD, says the path is resolved relative to the current working directory (only relevant for the last, relative one). The second argument is the path being tried, the third is the open mode. Three attempts come back -1 ENOENT: the file does not exist at that path, says the kernel, and the kernel is in a position to know. The fourth returns 3 — a real file descriptor, the search is over, and now you know with certainty which file this program reads. Note the 2>&1 on the command: strace writes its output to stderr, so you have to merge streams before you can pipe it through grep.

A connect that goes nowhere

Network calls decode just as cleanly, and this is where -tt -T earns its place in the default invocation. A struct argument is printed field by field, so the address and port the process is dialling are right there in the trace.

$ strace -tt -T -e trace=network ./myapp 2>&1 | tail -3
14:02:09.114220 socket(AF_INET, SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 <0.000019>
14:02:09.114501 connect(3, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.0.9.40")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000043>
14:02:09.114688 poll([{fd=3, events=POLLOUT}], 1, 30000) = 0 <30.000482>

The first timestamp column is -tt; the duration in angle brackets at the end of each line is -T. The connect on a non-blocking socket returns EINPROGRESS, which is not an error — it means "dialling, check back." The process then parks in poll waiting for the socket to become writable, with a 30-second timeout. The poll returns 0, meaning nothing became ready, after 30.000482 seconds. Translation: the process spent thirty full seconds trying to reach 10.0.9.40:5432 and got silence — no acceptance, no refusal, the signature of a host that is down or a firewall that drops packets. The application log for this same event said "retrying database connection." The trace says which host, which port, and exactly how long the silence lasted.

The read/EAGAIN idle loop

One more pattern, because misreading it wastes people's afternoons. On non-blocking descriptors, EAGAIN means "no data right now, try later" — and event-driven programs generate them as a matter of routine.

epoll_wait(5, [{events=EPOLLIN, data=...}], 64, -1) = 1 <4.180112>
read(7, "GET /healthz HTTP/1.1\r\nHost:..."..., 65536) = 412
read(7, 0x55d31be8a000, 65536) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, ...

The loop wakes from epoll_wait when a socket has data, reads the 412 bytes that arrived, reads again to drain the socket, gets EAGAIN because there is nothing left, and goes back to sleep. Every line of that is healthy. An event loop's trace is supposed to be full of EAGAINs; they are how the program discovers it has finished draining a socket. The pathological version looks different: read returning EAGAIN thousands of times per second with no epoll_wait between them is a busy loop, a bug, and a CPU furnace. Same errno, opposite meanings, and the difference is visible only in the rhythm of the surrounding calls. This is the general skill the tool builds — not memorising syscalls, but recognising the rhythms: search storms, dial-and-wait, drain loops.

Three production scenarios

The service that "hangs" on startup

A deploy goes out, the new instance starts, and then nothing: no listening port, no log output, no crash. The supervisor says it is running. Twenty minutes of staring at logs that end mid-thought tells you nothing, because the process is not failing — it is waiting, and programs do not log what they are waiting for. Attach and look:

$ sudo strace -f -tt -p 48112
strace: Process 48112 attached
14:21:40.005281 connect(4, {sa_family=AF_INET, sin_port=htons(6379), sin_addr=inet_addr("10.0.12.7")}, 16 ...the cursor sits here, mid-line...

That unfinished line is the whole diagnosis. The process is blocked inside a connect to 10.0.12.7:6379 — a Redis host, and evidently a dead or unreachable one, because a healthy connect returns in microseconds. The application never logged "connecting to cache" before making the call, so from the outside it looked like a mystery hang. From the syscall boundary it is a one-line answer with the address attached. The fix might be the dead host, a stale config pointing at an old IP, or a missing timeout in the client — but the twenty minutes of guessing is over in the first second of trace.

Which config file does it actually read?

You changed the setting. You restarted the service. The behaviour did not change. Somewhere there is a second copy of the config, or an environment-specific override, or a default baked into a package, and the program is reading that one instead of yours. Documentation will tell you where the config should live; only the trace tells you where the process actually looked. Run the startup under a filtered trace and keep only the opens that succeeded:

$ strace -f -e trace=openat -o /tmp/t.txt ./service --start
$ grep -v ENOENT /tmp/t.txt | grep -E '\.(yaml|conf|ini|json)'
openat(AT_FDCWD, "/etc/service/service.conf", O_RDONLY) = 3
openat(AT_FDCWD, "/etc/service/conf.d/99-override.conf", O_RDONLY) = 3

There is the culprit: a conf.d override, loaded after the main file, quietly winning every merge. This technique generalises to any "which file does it really use" question — which CA bundle, which DNS resolver config, which Python interpreter on a machine with four of them. Trace openat, filter out the ENOENTs, and read the survivors. Note that this answers a different question from lsof: lsof shows what a process holds open right now, and a config file read at startup was opened, parsed, and closed long before you went looking. Files that are touched rather than held are tracing's territory.

The container that exits instantly

A container starts and dies in under a second, exit code 127 or 1, with either no output at all or an unhelpful one-liner. The image worked in staging. Nothing useful reaches the logs because the failure happens before logging is initialised — typically in the dynamic linker or the entrypoint script, the part of startup no application developer thinks about. Run the entrypoint under strace -f (the -f is mandatory here, because entrypoint scripts spawn the real binary as a child) and read the last twenty lines:

$ strace -f -e trace=file ./entrypoint.sh 2>&1 | tail -5
[pid  9214] openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/libssl.so.3", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid  9214] writev(2, ..., "error while loading shared libraries: libssl.so.3"..., ...) = 94
[pid  9214] exit_group(127) = ?
[pid  9214] +++ exited with 127 +++

The binary was built against a newer libssl than the base image carries; the linker searched, found nothing, wrote one line to stderr that the container runtime dropped, and died. The other classic ending is execve(...) = -1 EACCES (Permission denied) — an entrypoint that lost its execute bit in the build, or a filesystem mounted noexec. Either way, the pattern is the same: when a process dies before it can speak, its syscall trace is the only testimony available, and the last few lines almost always contain a failing call followed by an exit_group with a non-zero status.

Underneath: ptrace, and what it costs

strace is not kernel magic and it has no privileged window into the machine. It is an ordinary user-space process built on one old, strange syscall: ptrace, the same primitive debuggers like gdb are built on. When strace attaches to a process, it asks the kernel to make it that process's tracer. From then on, the kernel stops the traced process at every syscall entry and every syscall exit, and lets the tracer inspect it while it is frozen — registers, memory, everything. That is the whole mechanism. The trace lines you read are strace decoding the frozen process's registers (which hold the syscall number and arguments) and reading argument buffers out of its memory, once on the way into the kernel and once on the way out.

traced processkernelstrace (tracer)1. write(1, "hi", 2) — traps into kernel2. syscall-entry stop: process frozen3. kernel wakes strace (waitpid returns)4. read registers + memory, print: write(1, "hi", 25. PTRACE_SYSCALL: resume until next stop6. kernel runs the real write()7. syscall-exit stop: frozen again8. strace woken again9. read return value, print: ) = 2 — resume10. process continues, none the wisertwo full stops and several context switches — for every single syscall
The stop-relay-resume loop. The traced process is frozen twice per syscall while strace, an ordinary process, reads its state and decides to let it continue.

Look at what that loop costs. A normal syscall is one quick round trip across the boundary. Under strace, the same syscall involves two complete stops of the traced process, two wakeups of the tracer, several ptrace calls to read registers and memory, and the context switches to match. A call that takes a microsecond untraced can take fifty or a hundred traced. For a program that makes a handful of syscalls per second, nobody notices. For a syscall-heavy workload — a busy web server, a database, anything doing small reads and writes in a hot loop — strace can slow the workload by 10x to 100x, and on a saturated production service that slowdown is itself an outage. This is not a flaw to be patched; it is the price of the design, paid on every call.

The practical rule: strace is a scalpel for processes that are stuck, broken, or mysterious, where correctness of the answer matters and the process is doing little anyway. It is the wrong instrument for "where does the time go" on a healthy busy service. For that, use a sampling profiler — perf interrupts the process a hundred-odd times a second to record where it is, which costs a rounding error instead of a multiplier; the trade-offs are covered in profiling. And when you want syscall-level visibility on production without the ptrace tax, the modern path is eBPF: programs that run inside the kernel at the same boundary and aggregate in place, no stopping, no relay. That toolchain gets its own treatment in eBPF observability. Think of strace, perf, and eBPF as three observers of the same boundary at three price points, and pick by how hot the thing you are observing is.

Pitfalls

Attaching to a hot production process casually. The overhead section above, restated as a warning label: strace -p on the busiest process on the box can take a degraded service and finish it off. Before attaching during an incident, ask how syscall-heavy the process is and whether it is already at the edge. Prefer -e trace=... filters (fewer stops relayed to you, though the process still stops for every call), keep the attachment short, and know that detaching with Ctrl-C leaves the process running normally. If the box is truly on fire, reach for sampling tools first and save the scalpel for a quieter moment or a sacrificial instance.

Forgetting -f. The classic. You trace a service's launcher, see a brief flurry of execve and clone, and then your trace goes quiet while the application visibly does things. Everything interesting is happening in a child process you are not following. Shell wrappers, process supervisors, pre-fork servers, thread pools — modern software is mostly children. There is rarely a good reason to omit -f; type it by reflex and read the [pid NNNN] prefixes to keep the actors straight.

Trusting truncated strings. The default 32-byte limit on printed strings means the trace shows write(2, "error while loading shared libra"..., 94) and cuts off exactly before the name of the library you needed. The ... at the end of a string is the tell. When the content of buffers matters — error messages, paths, HTTP headers, SQL — rerun with -s 256 or larger. Nothing is more deflating than catching the failure perfectly and finding the punchline truncated.

Containers refuse to be traced. Run strace inside a container and you will likely get ptrace: Operation not permitted, even as root inside the container. Default seccomp and capability profiles block ptrace because a process that can trace another can also puppet it. You need the SYS_PTRACE capability — with Docker that is --cap-add=SYS_PTRACE at run time, and in Kubernetes the cleaner route is an ephemeral debug container that shares the target pod's process namespace, traced from there. Decide how you will get a trace out of your platform before the night you need one.

Expecting strace to see everything. Two blind spots surprise people. Pure computation makes no syscalls, so CPU-bound spins are invisible. And reads or writes to memory-mapped files go through page faults, not read/write, so a program using mmap for I/O can move gigabytes while its trace shows almost nothing. An empty trace does not mean an idle process; it means a process that is not asking the kernel for anything new.

A drill you can run right now

Everything below is safe on any Linux machine: each step traces either a throwaway command or a sleep you started yourself. Ten minutes, and the summary table, the trace grammar, and the attach/detach cycle stop being theory.

Step 1 — the summary table. Trace something tiny and look at the shape first, the way you would on a real problem:

$ strace -c ls > /dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 21.42    0.000241          15        16           mmap
 13.61    0.000153          17         9           openat
 11.92    0.000134          19         7           read
  9.79    0.000110          18         6           close
   ...
------ ----------- ----------- --------- --------- ----------------
100.00    0.001125                   118         4 total

Even ls on an empty terminal makes over a hundred syscalls, most of them the dynamic linker mapping libraries before main() ever runs. The errors column on a healthy command is usually non-zero too — harmless probe-and-fallback lookups. Sit with that for a second: a baseline of failed syscalls is normal, which is why "find the error" is never as simple as grepping for -1.

Step 2 — the grammar, live. Filter to one syscall and read every line out loud:

$ strace -e trace=openat cat /etc/hosts > /dev/null
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/hosts", O_RDONLY) = 3
+++ exited with 0 +++

Three opens belong to the runtime — the linker cache, libc itself, locale data — and only the last one is the file you asked for. This is the config-search scenario from earlier in miniature, and it teaches the other essential reading habit: most of any trace is the runtime clearing its throat, and you learn to skip to the part where the program starts speaking.

Step 3 — attach and detach. Practise the live-attach cycle on a process that cannot possibly mind:

$ sleep 300 &
[1] 52214
$ strace -p 52214
strace: Process 52214 attached
restart_syscall(<... resuming interrupted read ...>   ← blocked, mid-call, exactly like the hung service
^C
strace: Process 52214 detached
$ kill %1

Notice two things. First, the trace shows an unfinished line: sleep is blocked inside a syscall, waiting for its timer, and that frozen mid-call line is precisely what the hung-startup scenario looks like in the wild — you have now seen the signature on a harmless target. Second, Ctrl-C detached strace and the sleep kept running; attaching is not a death sentence for the target. If the attach fails with Operation not permitted, you have met either the privilege rule (you can only trace your own processes without root) or the Yama ptrace_scope hardening, both of which are better discovered now than during an incident. For extra credit, rerun step 2 with -tt -T and watch the timestamps appear, then go poke at /proc/52214/status while attached and find the TracerPid field naming your strace — a preview of the next page.

If you remember one line. strace -c CMD for the shape, strace -f -tt -T -s 256 -e trace=... -o file for the detail, and when a process hangs with no explanation: attach, read the one frozen line, detach. The errno is the answer.

Further reading

  • strace(1) — the manual page — the -e filtering expressions and the syscall class names (file, network, signal, ipc) are the sections worth a careful read.
  • ptrace(2) — the primitive underneath; the description of syscall-stops explains everything about the overhead.
  • Julia Evans — the strace zine — the friendliest possible introduction, and proof that the tool fits on a few hand-drawn pages.
  • Semicolony — System calls — what is actually happening at the boundary strace watches.
Found this useful?