27 / 28

Linux / 27

Is it the network?

The app is slow, or failing, and somebody in the incident channel has already typed the sentence: "looks like a network issue." Sometimes they are right. Usually they are early. The network is the part of the system nobody in the room owns, which makes it the easiest thing to blame and the hardest accusation to retract. This page is the investigation that settles it — split the path into segments, measure each one separately, and acquit them one by one until a single segment is left holding the evidence. Each step is a command, the output you will actually see, and the decision the output forces.

Step 0 — frame the accusation

Start by taking the accusation seriously, because it deserves a fair trial rather than a reflexive dismissal. The network really is guilty sometimes: paths lose packets, resolvers die, firewalls eat SYNs, an MTU mismatch silently swallows every large response while the small ones sail through. But "the network" is not one thing. Between your process and the dependency it calls there are at least five separable segments: name resolution, the TCP handshake, the TLS handshake, the round trips across the path itself, and the time the far server spends thinking before it answers. A slow request is slow in one of those segments far more often than in all of them, and the entire investigation is the act of finding out which.

That framing also sets the standard of evidence. "Pings look fine" acquits nothing — ping tests one segment (the path) with one packet size at one moment, and most of the failures on this page would pass it. "The graph is spiky" convicts nothing. What you want at every step is a measurement that isolates a single segment, so that when you say "it is not DNS" or "it is the server" you can paste the number that proves it. The good news is that the first command does most of the isolating in one move.

The decision. Refuse to argue about "the network" as a whole. Pick the one dependency call that is slow or failing, and go measure its segments. Step 1.

Step 1 — curl -w: the four-way split

This is the single best move in the whole investigation, and it is one command. curl's -w flag prints internal timers after the request completes, and four of them carve the request into exactly the segments you need to separate. Put the format in a file once and keep it forever:

$ cat /tmp/timing.txt
    namelookup   %{time_namelookup}s
       connect   %{time_connect}s
    appconnect   %{time_appconnect}s
 starttransfer   %{time_starttransfer}s
         total   %{time_total}s
$ curl -sS -o /dev/null -w @/tmp/timing.txt https://payments.internal/v1/health
    namelookup   0.004s   ← DNS answered: resolution took 4 ms
       connect   0.006s   ← TCP handshake done: 6 − 4 = 2 ms to SYN/SYN-ACK/ACK
    appconnect   0.021s   ← TLS done: 21 − 6 = 15 ms of handshake
 starttransfer   1.913s   ← first response byte: 1913 − 21 = 1892 ms of server think time
         total   1.915s   ← body transfer after first byte: 2 ms

Read it as a relay race with cumulative split times. Each number is a timestamp measured from the start of the request, so the differences between consecutive lines are the segments: namelookup is DNS; connect minus namelookup is the TCP handshake, which is one round trip and therefore a clean proxy for path latency; appconnect minus connect is the TLS handshake, another one or two round trips plus crypto; and starttransfer minus appconnect is the time the server sat on a fully delivered request before producing the first byte of response. That last gap has a name in every CDN dashboard — time to first byte — and it is the number that settles most accusations on the spot.

In the capture above, the network's entire contribution is 21 milliseconds: fast DNS, a 2 ms handshake that proves the path is short and clean, a routine TLS exchange. Then the request sat inside the server for 1.9 seconds. That is not a network problem in any shape. The packets arrived promptly, the answer was slow to exist. Hand the numbers to the team that owns the dependency and the conversation changes instantly, because you are no longer saying "we think it might be on your side" — you are saying "your service held a complete request for 1.892 seconds before the first byte, here is the timing."

One run is an anecdote, so when the symptom is intermittent, loop it and print one line per attempt:

$ for i in $(seq 1 8); do curl -sS -o /dev/null -w '%{time_namelookup}  %{time_connect}  %{time_starttransfer}\n' https://payments.internal/v1/health; done
0.003  0.005  0.041
0.004  0.006  0.043
3.004  3.006  3.044   ← the slow ones: all three numbers shifted by the same 3 s
0.003  0.005  0.040
0.004  0.007  0.044
3.005  3.008  3.051   ← …because the 3 s happened before connect even started: it's DNS
0.003  0.005  0.042
0.004  0.006  0.039

The arithmetic does the diagnosis. On the slow runs, connect − namelookup is still 2 ms and starttransfer − connect is still about 38 ms — identical to the fast runs. The entire extra three seconds lives inside namelookup. The TCP path is innocent, TLS is innocent, the server is innocent; the suspect is resolution. And notice the shape of the number: 3.004, 3.005. Real congestion produces a smear of values; a timeout constant produces the same round number every time. Round numbers are confessions, and we will extract this one fully in the worked example below.

The four timers carve a request into segments with different owners. The differences between consecutive timers, not the raw values, are the evidence.

The decision. starttransfer − appconnect dominates → the server is slow, not the network; hand off with the timing and stop. namelookup big or spiky → step 2, DNS. connect − namelookup big, or the connection fails outright → step 3, TCP. Everything clean on this box but the symptom is intermittent or only hits some hosts → step 4, the path. Handshakes clean but large transfers stall → step 5.

Step 2 — when namelookup is big: ask DNS directly

curl told you resolution is slow; dig tells you which part of resolution. The question to separate is cache versus upstream: is the configured resolver answering slowly from its own cache (resolver sick), or answering instantly when it has the record and slowly when it has to go ask (upstream or authoritative sick), or not answering at all (dead, with a failover timeout doing the damage)? Two runs back to back start the split:

$ dig payments.internal
;; ANSWER SECTION:
payments.internal.    17    IN    A    10.40.12.7   ← TTL 17 s left: this answer is cached, expires soon
;; Query time: 3001 msec   ← three full seconds for a CACHED answer — the resolver path is sick
;; SERVER: 10.40.0.53#53(10.40.0.53)
$ dig payments.internal
;; Query time: 0 msec      ← same query, instant: so it is not "DNS is slow", it is "DNS is
;; SERVER: 10.40.0.53#53(10.40.0.53)         sometimes slow" — suspect one resolver in the list, or cache misses

The Query time line is the whole tool. A consistently slow cached answer means the resolver itself, or the path to it, is struggling. Instant cached answers with slow misses means the resolver is fine and its upstream is not — test that by querying the authoritative server directly with dig @ns1.example.net payments.internal and comparing. An answer that is sometimes instant and sometimes a round three seconds means the client is timing out on one resolver and failing over to another, which is a configuration problem wearing a latency costume: check /etc/resolv.conf for a dead first nameserver, because the libc stub resolver tries the list in order and pays the full timeout before moving on. Also glance at the TTL while you are here — a record with a 10-second TTL guarantees constant cache misses, so every bit of upstream slowness gets amplified into request latency on a schedule.

The deeper mechanics — what the flags mean, tracing delegation with +trace, why dig bypasses nsswitch and your app might not — live in dig, and the system it interrogates is the subject of DNS.

The decision. Cached answers slow → the resolver or the hop to it; fix or replace the resolver. Cached fast, misses slow → upstream/authoritative; compare with a direct dig @authoritative. Bimodal with a round-number slow mode → a dead resolver in resolv.conf and a timeout constant; fix the list. Any of these is an ending — DNS was the culprit, the rest of the network is acquitted.

Step 3 — when connect is big or failing: the TCP segment

Now suppose the split blamed the middle: connect − namelookup is fat, or curl never connects at all. The first distinction to extract is refused versus timeout, because they are opposite verdicts, and nc gets it in one line:

$ nc -vz 10.40.12.7 443
nc: connect to 10.40.12.7 port 443 (tcp) failed: Connection refused
   ↑ instant failure = a RST came back. The network DELIVERED that rejection — path works,
     host is up, nothing is listening on 443. Wrong port, crashed service, or wrong box.

$ nc -vz 10.40.12.7 443
(hangs…)
nc: connect to 10.40.12.7 port 443 (tcp) failed: Connection timed out
   ↑ the SYN vanished. Nothing answered, nothing rejected: firewall rule, security group,
     routing blackhole, or a host that is simply gone. THIS one is a network/filter problem.

"Refused" is, counterintuitively, good news for the network: a refusal is a packet, and packets that arrive prove the path. The investigation moves to the far host — is the process running, is it bound to the right interface and port. "Timed out" means something between you and the listener swallowed the SYN without comment, and that is where firewalls, cloud security groups, and routing mistakes live. Before escalating, check the boring local candidates: ip route get 10.40.12.7 shows which route and interface the kernel would actually use (wrong VPN routes and stale routing tables are found here more often than anyone admits), and ip -s link shows interface-level errors and drops. Both are covered in ip.

If connections establish but feel slow, stop guessing and ask the kernel, because it has been measuring every connection all along. ss -ti prints the TCP stack's own statistics for a live connection:

$ ss -ti dst 10.40.12.7
ESTAB  0  0  10.40.3.21:51844  10.40.12.7:443
	 cubic wscale:7,7 rto:204 rtt:2.4/0.6 mss:1448 pmtu:1500 cwnd:36
	 bytes_acked:48211324 segs_out:34122 segs_in:30876
	 retrans:0/4 dsacks_dup:1 rcv_space:67392 minrtt:1.9
   rtt:2.4/0.6 → smoothed round-trip 2.4 ms, variance 0.6 — measured on real traffic, not pings
   retrans:0/4 → 0 segments in flight as retransmissions now, 4 total in the connection's life.
                 4 out of 34,122 sent is a clean path. A guilty path looks like retrans:0/3811.
   minrtt:1.9  → the best the path has ever done; rtt far above minrtt = queueing, not distance

This is the cheapest high-quality evidence in the whole investigation. The kernel timestamps every acknowledged segment, so rtt here is the real round-trip your traffic experiences, and retrans is an exact count of how many times TCP had to resend. A connection showing rtt:2.4 and retrans:0/4 acquits the path outright, whatever the dashboards say. One showing rtt:48/22 against a minrtt:1.9, or thousands of total retransmissions, convicts it — and gives you the number to escalate with. The full field-by-field tour is in ss, and what the retransmission and congestion machinery is actually doing under those counters is in TCP.

The decision. Refused → the far host's problem; network acquitted, go look at the listener. Timeout → filter or routing; check ip route get locally, then escalate to whoever owns the firewall. Established but slow with high rtt/retrans in ss -ti → real path trouble; go to step 4 to find where. Clean counters → the path is fine from this box; reconsider the accusation.

Step 4 — when it's intermittent: walk the path with mtr

Intermittent symptoms are where investigations go to die, because every one-shot tool keeps catching the good moments. mtr is traceroute that refuses to stop: it probes every hop on the path continuously and keeps per-hop loss and latency statistics, so a problem that shows up 4% of the time shows up as a 4% loss figure instead of as an argument. Run it in report mode with enough cycles to make the percentages mean something:

$ mtr --report --report-cycles 100 payments-gw.example.net
HOST: checkout-04                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.40.3.1                   0.0%   100    0.4   0.4   0.3   1.2   0.1
  2.|-- 10.40.0.9                   0.0%   100    0.7   0.8   0.6   2.4   0.2
  3.|-- core-rtr-2.example.net     40.0%   100    1.1   1.2   0.9   3.0   0.3
  4.|-- edge-fw-1.example.net       0.0%   100    1.6   1.8   1.4  11.2   1.1
  5.|-- payments-gw.example.net     0.0%   100    2.1   2.3   1.8   9.7   0.9
   40% "loss" at hop 3, but 0% at every hop AFTER it — hop 3 is rate-limiting its own
   ICMP replies while forwarding your traffic perfectly. Not loss. Read the LAST hop.

That hop-3 number is the classic misread, and it has ended careers of credibility in incident channels. Routers answer traceroute probes with their control plane, which is deliberately rate-limited and deprioritised — forwarding your packets is the job, answering your probes is a courtesy. So an intermediate hop showing loss that the hops after it do not inherit is a router declining the courtesy, nothing more. Real loss propagates: if hop 4 starts dropping your packets, then hops 5, 6, and the destination all show that loss too, because the probes to them pass through hop 4. The rule is mechanical — loss begins at hop N and continues to the destination: real, and it started near hop N. Loss appears at one hop and vanishes afterward: cosmetic, ignore it.

$ mtr --report --report-cycles 100 payments-gw.example.net   (a genuinely bad path)
HOST: checkout-04                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 10.40.3.1                   0.0%   100    0.4   0.4   0.3   1.1   0.1
  2.|-- 10.40.0.9                   0.0%   100    0.7   0.8   0.6   2.2   0.2
  3.|-- core-rtr-2.example.net      0.0%   100    1.1   1.2   0.9   2.8   0.3
  4.|-- transit-a.upstream.net     6.0%   100   24.3  31.8   1.6  212.4  44.0
  5.|-- payments-gw.example.net    7.0%   100   25.1  33.2   2.1  208.9  45.2
   loss starts at hop 4 and the DESTINATION inherits it; Avg jumps 1.2 → 31.8 ms and StDev
   explodes at the same hop. The path degrades at the handoff to transit-a. That is the report.

Notice everything arriving at once in the second report: loss begins at hop 4, persists to the end, average latency jumps twenty-fold at the same hop, and the standard deviation explodes — a queueing, congested, or failing link, located. That report is also exactly the artifact your provider's support queue needs; "users say it's slow" gets triaged, an mtr report with loss starting at their hop gets fixed. Usage details, the UDP/TCP probe modes for paths that filter ICMP, and the nc pairing live in nc & mtr.

The decision. Loss only at intermediate hops, last hop clean → cosmetic ICMP deprioritisation; the path is fine, go back and re-read your curl numbers. Loss that begins at hop N and propagates to the destination → real path degradation; capture the report and escalate to whoever owns hop N. Path clean both directions → the problem is at an endpoint; step 5 will say which one.

Step 5 — the courtroom: tcpdump

When the cheap tools disagree, or when you need evidence that survives cross-examination, capture the packets. You rarely need more than thirty seconds of a filtered capture, and you are looking for three specific signatures, each of which convicts a different party:

$ sudo tcpdump -ni eth0 host 10.40.12.7 and port 443
# signature 1 — retransmission: the same seq range sent twice, ~200 ms apart
10:41:22.183442 IP 10.40.3.21.51844 > 10.40.12.7.443: Flags [P.], seq 88412:89860, ack 1, win 501, length 1448
10:41:22.391077 IP 10.40.3.21.51844 > 10.40.12.7.443: Flags [P.], seq 88412:89860, ack 1, win 501, length 1448
   ↑ we sent it, no ACK came, the RTO fired and we sent it again. Something between the
     boxes lost either the data or its ACK. This is what real network loss looks like.

# signature 2 — duplicate ACKs: the receiver acking the same byte over and over
10:41:23.102211 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 91308, win 262, length 0
10:41:23.102389 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 91308, win 262, length 0
10:41:23.102541 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 91308, win 262, length 0
   ↑ "I'm still waiting for byte 91308" three times = a segment went missing in flight and
     later ones arrived. Loss again — and it pinpoints the direction: toward the receiver.

# signature 3 — zero window: the receiver's buffer is full
10:41:24.518230 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 184204, win 0, length 0
   ↑ the packet ARRIVED and was acked — the network did its job. win 0 means the receiving
     APPLICATION is not reading from its socket. The network is acquitted; the app is guilty.

The third signature is the one to memorise, because it is the great exoneration. A zero window advertisement means TCP on the far side received your data, acknowledged it, and has nowhere to put more because the application above it has stopped draining the socket buffer. Every packet did everything right. The "network problem" is a consumer that cannot keep up — a thread pool exhausted, a downstream call blocking the read loop, a process deep in GC. Transfers that start fast and then stall in lockstep with the far service's CPU or latency graphs are this, and the fix is in that service, not in any switch. The capture-filter syntax, reading flags fluently, and writing captures to files for Wireshark are covered in tcpdump; the sequence and window machinery these signatures come from is in TCP.

The decision. Retransmissions or duplicate ACKs at a meaningful rate → real loss; pair with the mtr report to locate it and escalate. win 0 from the far side → the receiving application is the bottleneck; network acquitted, hand off to that team. win 0 from your side → your own app is the slow consumer; the accusation reverses. A clean capture under load → write down that the wire is innocent and go back to the timing split.

The decision tree, on one screen

Every branch either names a guilty segment or acquits one. The dashed box is the most common early exit: starttransfer dominates and the investigation is over in one command.

The tree has a shape worth noticing: it is mostly exits. The honest outcome of a network investigation is usually an acquittal with a forwarding address — "not the network, here is whose it is, here are the numbers." The steps exist to make that handoff stick. If a step neither convicts a segment nor acquits one, you measured the wrong thing; go back to the timing split and pick the segment the numbers actually point at.

The endings

Nearly every "it's the network" incident resolves to one of five stories, and knowing them ahead of time changes how you read every output above.

Ending 1 — the server was slow all along

The most common ending by a wide margin. starttransfer dominates the split, the handshakes are crisp, ss -ti shows a clean connection. The dependency is slow to produce answers, and everything between you and it is innocent. The work here is the handoff: paste the curl timing, name the gap ("1.9 s between request delivered and first byte"), and resist the urge to soften it into "might be on your side." Evidence makes handoffs frictionless; hedging makes them ping-pong. If the dependency's owners then run this same investigation from their side, against their downstreams, the incident converges instead of circulating.

Ending 2 — DNS

A dead or sick resolver, a timeout constant doing failover damage, a TTL short enough to turn every upstream wobble into request latency, or a resolver that is fine while the authoritative zone is not. The signature is latency quantised into round numbers and confined entirely to namelookup. Fixes are unglamorous and fast: repair the resolver list, drop the timeout, run a local caching daemon, lengthen the TTL. The follow-up that prevents the rerun is monitoring resolution time as its own metric instead of letting it hide inside request latency.

Ending 3 — real path loss

The genuinely guilty network: retransmissions in ss -ti and the capture, loss in mtr that begins at a hop and propagates to the destination, latency variance exploding at the same hop. You usually cannot fix this yourself — the work is escalation done well. Attach the mtr report from both directions if you can get one back, the retrans counters, a capture excerpt, and the time window. A provider ticket with loss pinned to their hop and a packet trace gets engineering attention; one that says "customers report slowness" gets a script reply. Until it is fixed, route around it if you have the option, and say so in the incident notes.

Ending 4 — app back-pressure wearing a network costume

Zero-window advertisements: the far application (or yours) stopped draining its socket, TCP closed the window, throughput collapsed, and every dashboard called it a network problem because the symptom appeared between two boxes. The network delivered and acknowledged every byte. This ending matters because it loops back into the other investigations in this series — a consumer that cannot keep up is usually CPU-bound, lock-bound, or GC-bound, and the packet capture has just told you exactly which process to point those tools at.

Ending 5 — the MTU blackhole

The strangest one, and instantly recognisable once you know the telltale: small requests succeed, large ones hang forever. Health checks pass, logins work, and the 4 KB JSON upload dies every time. Somewhere on the path — typically a VPN, a tunnel, or an overlay network — the MTU is smaller than the endpoints believe, full-size packets are dropped, and the ICMP "fragmentation needed" messages that would trigger path-MTU discovery are being filtered, so the sender never learns. The handshake packets are small and always survive, which is why connections establish and then freeze. Confirm with ping -M do -s 1472 host: a don't-fragment ping at full size fails while small pings succeed, and shrinking the size until it passes measures the real path MTU. Fix by clamping MSS at the tunnel or lowering the interface MTU with ip. Any failure that is conditional on payload size should make you think MTU before anything else.

A worked example, end to end

Here is the path run once, the way it actually happens. Checkout calls the payments API on every order. The pager fires at 10:38: checkout p99 has spiked from 180 ms to over 3 seconds, intermittently, for twenty minutes. The payments team's own dashboard shows their p99 flat at 40 ms, so the incident channel has already reached its verdict: "must be network between us." SSH into a checkout box and put numbers on it.

$ for i in $(seq 1 12); do curl -sS -o /dev/null -w '%{time_namelookup}  %{time_connect}  %{time_starttransfer}\n' https://payments.internal/v1/health; done
0.003  0.005  0.041
0.004  0.006  0.042
3.004  3.006  3.045
0.003  0.005  0.040
0.003  0.006  0.043
0.004  0.005  0.041
3.006  3.008  3.049
0.003  0.005  0.042
  → connect − namelookup ≈ 2 ms always; starttransfer − connect ≈ 37 ms always. The path and
    the server are identical on good and bad runs. The 3 s lives entirely in namelookup —
    and it is the same round 3 s every time. A timeout constant. DNS. Step 2.

One loop and the field has narrowed from "the network" to "name resolution on this box." The payments team is already acquitted — their flat 40 ms p99 was telling the truth, since their clock starts when the request arrives, and these requests were losing three seconds before being sent. Now make DNS confess:

$ cat /etc/resolv.conf
nameserver 10.40.0.53
nameserver 10.40.0.54
options timeout:3 attempts:2   ← there is the 3-second constant, in writing
$ dig @10.40.0.53 payments.internal
;; communications error to 10.40.0.53#53: timed out   ← first resolver: dead
$ dig @10.40.0.54 payments.internal
payments.internal.    30    IN    A    10.40.12.7
;; Query time: 2 msec                                ← second resolver: perfectly healthy

The whole story is now in four lines. The first nameserver in the list died around 10:15 (its host was part of a maintenance batch, it later turns out). The record's TTL is 30 seconds, so each checkout process gets 30 seconds of cached, instant resolution, then a cache miss — and every miss tries the dead resolver first, waits the configured 3 seconds, fails over to the healthy one, and succeeds. Hence intermittent: most requests ride the cache; the unlucky ones pay exactly 3.00 s. Hence invisible to payments: the delay happens before the request exists on the wire. The mitigation is one line — pull 10.40.0.53 from resolv.conf across the fleet, p99 drops back to 180 ms within a minute. The fixes that outlast the incident: restore the resolver, set options timeout:1 rotate so a single dead entry costs one second instead of three, and add a resolution-time metric so the next dead resolver pages someone as a DNS problem instead of arriving dressed as a checkout latency mystery. Total investigation, about six minutes, and not one step was a guess: the loop isolated the segment, the round number named the mechanism, resolv.conf supplied the constant, and two digs found the body.

The fast version. When you have two minutes, not twenty: curl -w with the four timers in a small loop (which segment is slow — if starttransfer, stop: it's the server) · dig name twice (cached vs miss, and the Query time) · nc -vz host port (refused vs timeout) · ss -ti dst host (the kernel's own rtt and retrans counters) · mtr --report --report-cycles 100 host (does loss reach the last hop?). Five commands, each seconds long, and afterwards you either have the guilty segment or you have the evidence that the network is innocent.

What to write in the incident notes

Network accusations recur, so the write-up's job is to make the next one shorter. Five things belong in it. First, the verdict in one line, named by segment: "DNS — dead first resolver plus 3 s timeout," not "network issues." The next person grepping old incidents for a p99 spike needs the classification more than the narrative. Second, the evidence chain as raw output: the curl timing lines, the dig query times, the ss -ti counters, the mtr table — pasted, not paraphrased, because numbers can be re-examined when someone doubts the conclusion and prose cannot. Third, the acquittals, stated explicitly: "path clean (retrans 4/34k), server think time flat at 37 ms" — written-down acquittals are what stop the next incident from re-litigating the same suspects. Fourth, what you changed and when, to the minute, so the recovery edge in the graphs has an explanation, and what you deliberately did not change. Fifth, the follow-ups with owners: the resolver restore, the resolv.conf hardening, the resolution-time metric, the provider ticket number if this was real path loss. An investigation that ends with an acquittal, a guilty segment, and a prevention item is one the team only pays for once.

Is it the network?

Step 0 — frame the accusation

Step 1 — curl -w: the four-way split

Step 2 — when namelookup is big: ask DNS directly

Step 3 — when connect is big or failing: the TCP segment

Step 4 — when it's intermittent: walk the path with mtr

Step 5 — the courtroom: tcpdump

The decision tree, on one screen

The endings

Ending 1 — the server was slow all along

Ending 2 — DNS

Ending 3 — real path loss

Ending 4 — app back-pressure wearing a network costume

Ending 5 — the MTU blackhole

A worked example, end to end

What to write in the incident notes

Further reading

28 — The box is slow