Is it the network?
The app is slow, or failing, and somebody in the incident channel has already typed the sentence: "looks like a network issue." Sometimes they are right. Usually they are early. The network is the part of the system nobody in the room owns, which makes it the easiest thing to blame and the hardest accusation to retract. This page is the investigation that settles it — split the path into segments, measure each one separately, and acquit them one by one until a single segment is left holding the evidence. Each step is a command, the output you will actually see, and the decision the output forces.
Step 0 — frame the accusation
Start by taking the accusation seriously, because it deserves a fair trial rather than a reflexive dismissal. The network really is guilty sometimes: paths lose packets, resolvers die, firewalls eat SYNs, an MTU mismatch silently swallows every large response while the small ones sail through. But "the network" is not one thing. Between your process and the dependency it calls there are at least five separable segments: name resolution, the TCP handshake, the TLS handshake, the round trips across the path itself, and the time the far server spends thinking before it answers. A slow request is slow in one of those segments far more often than in all of them, and the entire investigation is the act of finding out which.
That framing also sets the standard of evidence. "Pings look fine" acquits nothing — ping tests one segment (the path) with one packet size at one moment, and most of the failures on this page would pass it. "The graph is spiky" convicts nothing. What you want at every step is a measurement that isolates a single segment, so that when you say "it is not DNS" or "it is the server" you can paste the number that proves it. The good news is that the first command does most of the isolating in one move.
Step 1 — curl -w: the four-way split
This is the single best move in the whole investigation, and it is one command. curl's
-w flag prints internal timers after the request completes, and four of them
carve the request into exactly the segments you need to separate. Put the format in a file
once and keep it forever:
$ cat /tmp/timing.txt namelookup %{time_namelookup}s connect %{time_connect}s appconnect %{time_appconnect}s starttransfer %{time_starttransfer}s total %{time_total}s $ curl -sS -o /dev/null -w @/tmp/timing.txt https://payments.internal/v1/health namelookup 0.004s ← DNS answered: resolution took 4 ms connect 0.006s ← TCP handshake done: 6 − 4 = 2 ms to SYN/SYN-ACK/ACK appconnect 0.021s ← TLS done: 21 − 6 = 15 ms of handshake starttransfer 1.913s ← first response byte: 1913 − 21 = 1892 ms of server think time total 1.915s ← body transfer after first byte: 2 ms
Read it as a relay race with cumulative split times. Each number is a timestamp measured
from the start of the request, so the differences between consecutive lines are
the segments: namelookup is DNS; connect minus
namelookup is the TCP handshake, which is one round trip and therefore a clean
proxy for path latency; appconnect minus connect is the TLS
handshake, another one or two round trips plus crypto; and starttransfer minus
appconnect is the time the server sat on a fully delivered request before
producing the first byte of response. That last gap has a name in every CDN dashboard —
time to first byte — and it is the number that settles most accusations on the spot.
In the capture above, the network's entire contribution is 21 milliseconds: fast DNS, a 2 ms handshake that proves the path is short and clean, a routine TLS exchange. Then the request sat inside the server for 1.9 seconds. That is not a network problem in any shape. The packets arrived promptly, the answer was slow to exist. Hand the numbers to the team that owns the dependency and the conversation changes instantly, because you are no longer saying "we think it might be on your side" — you are saying "your service held a complete request for 1.892 seconds before the first byte, here is the timing."
One run is an anecdote, so when the symptom is intermittent, loop it and print one line per attempt:
$ for i in $(seq 1 8); do curl -sS -o /dev/null -w '%{time_namelookup} %{time_connect} %{time_starttransfer}\n' https://payments.internal/v1/health; done 0.003 0.005 0.041 0.004 0.006 0.043 3.004 3.006 3.044 ← the slow ones: all three numbers shifted by the same 3 s 0.003 0.005 0.040 0.004 0.007 0.044 3.005 3.008 3.051 ← …because the 3 s happened before connect even started: it's DNS 0.003 0.005 0.042 0.004 0.006 0.039
The arithmetic does the diagnosis. On the slow runs, connect − namelookup is
still 2 ms and starttransfer − connect is still about 38 ms — identical to the
fast runs. The entire extra three seconds lives inside namelookup. The TCP
path is innocent, TLS is innocent, the server is innocent; the suspect is resolution. And
notice the shape of the number: 3.004, 3.005. Real congestion produces a smear of values;
a timeout constant produces the same round number every time. Round numbers are
confessions, and we will extract this one fully in the worked example below.
starttransfer − appconnect dominates → the server
is slow, not the network; hand off with the timing and stop. namelookup big or
spiky → step 2, DNS. connect − namelookup big, or the connection fails outright
→ step 3, TCP. Everything clean on this box but the symptom is intermittent or only hits
some hosts → step 4, the path. Handshakes clean but large transfers stall → step 5.Step 2 — when namelookup is big: ask DNS directly
curl told you resolution is slow; dig tells you which part of resolution.
The question to separate is cache versus upstream: is the configured resolver answering
slowly from its own cache (resolver sick), or answering instantly when it has the record
and slowly when it has to go ask (upstream or authoritative sick), or not answering at all
(dead, with a failover timeout doing the damage)? Two runs back to back start the split:
$ dig payments.internal ;; ANSWER SECTION: payments.internal. 17 IN A 10.40.12.7 ← TTL 17 s left: this answer is cached, expires soon ;; Query time: 3001 msec ← three full seconds for a CACHED answer — the resolver path is sick ;; SERVER: 10.40.0.53#53(10.40.0.53) $ dig payments.internal ;; Query time: 0 msec ← same query, instant: so it is not "DNS is slow", it is "DNS is ;; SERVER: 10.40.0.53#53(10.40.0.53) sometimes slow" — suspect one resolver in the list, or cache misses
The Query time line is the whole tool. A consistently slow cached answer means
the resolver itself, or the path to it, is struggling. Instant cached answers with slow
misses means the resolver is fine and its upstream is not — test that by querying the
authoritative server directly with dig @ns1.example.net payments.internal and
comparing. An answer that is sometimes instant and sometimes a round three seconds means
the client is timing out on one resolver and failing over to another, which is a
configuration problem wearing a latency costume: check /etc/resolv.conf for a
dead first nameserver, because the libc stub resolver tries the list in order
and pays the full timeout before moving on. Also glance at the TTL while you are here — a
record with a 10-second TTL guarantees constant cache misses, so every bit of upstream
slowness gets amplified into request latency on a schedule.
The deeper mechanics — what the flags mean, tracing delegation with +trace,
why dig bypasses nsswitch and your app might not — live in
dig, and the system it interrogates is the
subject of DNS.
dig @authoritative. Bimodal with a round-number slow mode → a dead
resolver in resolv.conf and a timeout constant; fix the list. Any of these is
an ending — DNS was the culprit, the rest of the network is acquitted.Step 3 — when connect is big or failing: the TCP segment
Now suppose the split blamed the middle: connect − namelookup is fat, or curl
never connects at all. The first distinction to extract is refused versus timeout,
because they are opposite verdicts, and nc gets it in one line:
$ nc -vz 10.40.12.7 443 nc: connect to 10.40.12.7 port 443 (tcp) failed: Connection refused ↑ instant failure = a RST came back. The network DELIVERED that rejection — path works, host is up, nothing is listening on 443. Wrong port, crashed service, or wrong box. $ nc -vz 10.40.12.7 443 (hangs…) nc: connect to 10.40.12.7 port 443 (tcp) failed: Connection timed out ↑ the SYN vanished. Nothing answered, nothing rejected: firewall rule, security group, routing blackhole, or a host that is simply gone. THIS one is a network/filter problem.
"Refused" is, counterintuitively, good news for the network: a refusal is a packet, and
packets that arrive prove the path. The investigation moves to the far host — is the
process running, is it bound to the right interface and port. "Timed out" means something
between you and the listener swallowed the SYN without comment, and that is where firewalls,
cloud security groups, and routing mistakes live. Before escalating, check the boring local
candidates: ip route get 10.40.12.7 shows which route and interface the kernel
would actually use (wrong VPN routes and stale routing tables are found here more often
than anyone admits), and ip -s link shows interface-level errors and drops.
Both are covered in ip.
If connections establish but feel slow, stop guessing and ask the kernel, because it has
been measuring every connection all along. ss -ti prints the TCP stack's own
statistics for a live connection:
$ ss -ti dst 10.40.12.7 ESTAB 0 0 10.40.3.21:51844 10.40.12.7:443 cubic wscale:7,7 rto:204 rtt:2.4/0.6 mss:1448 pmtu:1500 cwnd:36 bytes_acked:48211324 segs_out:34122 segs_in:30876 retrans:0/4 dsacks_dup:1 rcv_space:67392 minrtt:1.9 rtt:2.4/0.6 → smoothed round-trip 2.4 ms, variance 0.6 — measured on real traffic, not pings retrans:0/4 → 0 segments in flight as retransmissions now, 4 total in the connection's life. 4 out of 34,122 sent is a clean path. A guilty path looks like retrans:0/3811. minrtt:1.9 → the best the path has ever done; rtt far above minrtt = queueing, not distance
This is the cheapest high-quality evidence in the whole investigation. The kernel timestamps
every acknowledged segment, so rtt here is the real round-trip your traffic
experiences, and retrans is an exact count of how many times TCP had to resend.
A connection showing rtt:2.4 and retrans:0/4 acquits the path
outright, whatever the dashboards say. One showing rtt:48/22 against a
minrtt:1.9, or thousands of total retransmissions, convicts it — and gives you
the number to escalate with. The full field-by-field tour is in
ss, and what the retransmission and congestion
machinery is actually doing under those counters is in
TCP.
ip route get locally, then
escalate to whoever owns the firewall. Established but slow with high
rtt/retrans in ss -ti → real path trouble; go to
step 4 to find where. Clean counters → the path is fine from this box; reconsider the
accusation.Step 4 — when it's intermittent: walk the path with mtr
Intermittent symptoms are where investigations go to die, because every one-shot tool keeps
catching the good moments. mtr is traceroute that refuses to stop: it probes
every hop on the path continuously and keeps per-hop loss and latency statistics, so a
problem that shows up 4% of the time shows up as a 4% loss figure instead of as an argument.
Run it in report mode with enough cycles to make the percentages mean something:
$ mtr --report --report-cycles 100 payments-gw.example.net HOST: checkout-04 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.40.3.1 0.0% 100 0.4 0.4 0.3 1.2 0.1 2.|-- 10.40.0.9 0.0% 100 0.7 0.8 0.6 2.4 0.2 3.|-- core-rtr-2.example.net 40.0% 100 1.1 1.2 0.9 3.0 0.3 4.|-- edge-fw-1.example.net 0.0% 100 1.6 1.8 1.4 11.2 1.1 5.|-- payments-gw.example.net 0.0% 100 2.1 2.3 1.8 9.7 0.9 40% "loss" at hop 3, but 0% at every hop AFTER it — hop 3 is rate-limiting its own ICMP replies while forwarding your traffic perfectly. Not loss. Read the LAST hop.
That hop-3 number is the classic misread, and it has ended careers of credibility in incident channels. Routers answer traceroute probes with their control plane, which is deliberately rate-limited and deprioritised — forwarding your packets is the job, answering your probes is a courtesy. So an intermediate hop showing loss that the hops after it do not inherit is a router declining the courtesy, nothing more. Real loss propagates: if hop 4 starts dropping your packets, then hops 5, 6, and the destination all show that loss too, because the probes to them pass through hop 4. The rule is mechanical — loss begins at hop N and continues to the destination: real, and it started near hop N. Loss appears at one hop and vanishes afterward: cosmetic, ignore it.
$ mtr --report --report-cycles 100 payments-gw.example.net (a genuinely bad path) HOST: checkout-04 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.40.3.1 0.0% 100 0.4 0.4 0.3 1.1 0.1 2.|-- 10.40.0.9 0.0% 100 0.7 0.8 0.6 2.2 0.2 3.|-- core-rtr-2.example.net 0.0% 100 1.1 1.2 0.9 2.8 0.3 4.|-- transit-a.upstream.net 6.0% 100 24.3 31.8 1.6 212.4 44.0 5.|-- payments-gw.example.net 7.0% 100 25.1 33.2 2.1 208.9 45.2 loss starts at hop 4 and the DESTINATION inherits it; Avg jumps 1.2 → 31.8 ms and StDev explodes at the same hop. The path degrades at the handoff to transit-a. That is the report.
Notice everything arriving at once in the second report: loss begins at hop 4, persists to the end, average latency jumps twenty-fold at the same hop, and the standard deviation explodes — a queueing, congested, or failing link, located. That report is also exactly the artifact your provider's support queue needs; "users say it's slow" gets triaged, an mtr report with loss starting at their hop gets fixed. Usage details, the UDP/TCP probe modes for paths that filter ICMP, and the nc pairing live in nc & mtr.
Step 5 — the courtroom: tcpdump
When the cheap tools disagree, or when you need evidence that survives cross-examination, capture the packets. You rarely need more than thirty seconds of a filtered capture, and you are looking for three specific signatures, each of which convicts a different party:
$ sudo tcpdump -ni eth0 host 10.40.12.7 and port 443 # signature 1 — retransmission: the same seq range sent twice, ~200 ms apart 10:41:22.183442 IP 10.40.3.21.51844 > 10.40.12.7.443: Flags [P.], seq 88412:89860, ack 1, win 501, length 1448 10:41:22.391077 IP 10.40.3.21.51844 > 10.40.12.7.443: Flags [P.], seq 88412:89860, ack 1, win 501, length 1448 ↑ we sent it, no ACK came, the RTO fired and we sent it again. Something between the boxes lost either the data or its ACK. This is what real network loss looks like. # signature 2 — duplicate ACKs: the receiver acking the same byte over and over 10:41:23.102211 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 91308, win 262, length 0 10:41:23.102389 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 91308, win 262, length 0 10:41:23.102541 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 91308, win 262, length 0 ↑ "I'm still waiting for byte 91308" three times = a segment went missing in flight and later ones arrived. Loss again — and it pinpoints the direction: toward the receiver. # signature 3 — zero window: the receiver's buffer is full 10:41:24.518230 IP 10.40.12.7.443 > 10.40.3.21.51844: Flags [.], ack 184204, win 0, length 0 ↑ the packet ARRIVED and was acked — the network did its job. win 0 means the receiving APPLICATION is not reading from its socket. The network is acquitted; the app is guilty.
The third signature is the one to memorise, because it is the great exoneration. A zero window advertisement means TCP on the far side received your data, acknowledged it, and has nowhere to put more because the application above it has stopped draining the socket buffer. Every packet did everything right. The "network problem" is a consumer that cannot keep up — a thread pool exhausted, a downstream call blocking the read loop, a process deep in GC. Transfers that start fast and then stall in lockstep with the far service's CPU or latency graphs are this, and the fix is in that service, not in any switch. The capture-filter syntax, reading flags fluently, and writing captures to files for Wireshark are covered in tcpdump; the sequence and window machinery these signatures come from is in TCP.
win 0 from the
far side → the receiving application is the bottleneck; network acquitted, hand off to that
team. win 0 from your side → your own app is the slow consumer; the
accusation reverses. A clean capture under load → write down that the wire is innocent and
go back to the timing split.The decision tree, on one screen
The tree has a shape worth noticing: it is mostly exits. The honest outcome of a network investigation is usually an acquittal with a forwarding address — "not the network, here is whose it is, here are the numbers." The steps exist to make that handoff stick. If a step neither convicts a segment nor acquits one, you measured the wrong thing; go back to the timing split and pick the segment the numbers actually point at.
The endings
Nearly every "it's the network" incident resolves to one of five stories, and knowing them ahead of time changes how you read every output above.
Ending 1 — the server was slow all along
The most common ending by a wide margin. starttransfer dominates the split, the
handshakes are crisp, ss -ti shows a clean connection. The dependency is slow
to produce answers, and everything between you and it is innocent. The work here is the
handoff: paste the curl timing, name the gap ("1.9 s between request delivered and first
byte"), and resist the urge to soften it into "might be on your side." Evidence makes
handoffs frictionless; hedging makes them ping-pong. If the dependency's owners then run
this same investigation from their side, against their downstreams, the
incident converges instead of circulating.
Ending 2 — DNS
A dead or sick resolver, a timeout constant doing failover damage, a TTL short enough to
turn every upstream wobble into request latency, or a resolver that is fine while the
authoritative zone is not. The signature is latency quantised into round numbers and
confined entirely to namelookup. Fixes are unglamorous and fast: repair the
resolver list, drop the timeout, run a local caching daemon, lengthen the TTL. The
follow-up that prevents the rerun is monitoring resolution time as its own metric instead
of letting it hide inside request latency.
Ending 3 — real path loss
The genuinely guilty network: retransmissions in ss -ti and the capture, loss
in mtr that begins at a hop and propagates to the destination, latency variance exploding
at the same hop. You usually cannot fix this yourself — the work is escalation done well.
Attach the mtr report from both directions if you can get one back, the
retrans counters, a capture excerpt, and the time window. A provider ticket
with loss pinned to their hop and a packet trace gets engineering attention; one that says
"customers report slowness" gets a script reply. Until it is fixed, route around it if you
have the option, and say so in the incident notes.
Ending 4 — app back-pressure wearing a network costume
Zero-window advertisements: the far application (or yours) stopped draining its socket, TCP closed the window, throughput collapsed, and every dashboard called it a network problem because the symptom appeared between two boxes. The network delivered and acknowledged every byte. This ending matters because it loops back into the other investigations in this series — a consumer that cannot keep up is usually CPU-bound, lock-bound, or GC-bound, and the packet capture has just told you exactly which process to point those tools at.
Ending 5 — the MTU blackhole
The strangest one, and instantly recognisable once you know the telltale: small
requests succeed, large ones hang forever. Health checks pass, logins work, and the
4 KB JSON upload dies every time. Somewhere on the path — typically a VPN, a tunnel, or an
overlay network — the MTU is smaller than the endpoints believe, full-size packets are
dropped, and the ICMP "fragmentation needed" messages that would trigger path-MTU discovery
are being filtered, so the sender never learns. The handshake packets are small and always
survive, which is why connections establish and then freeze. Confirm with
ping -M do -s 1472 host: a don't-fragment ping at full size fails while small
pings succeed, and shrinking the size until it passes measures the real path MTU. Fix by
clamping MSS at the tunnel or lowering the interface MTU with
ip. Any failure that is conditional on payload
size should make you think MTU before anything else.
A worked example, end to end
Here is the path run once, the way it actually happens. Checkout calls the payments API on every order. The pager fires at 10:38: checkout p99 has spiked from 180 ms to over 3 seconds, intermittently, for twenty minutes. The payments team's own dashboard shows their p99 flat at 40 ms, so the incident channel has already reached its verdict: "must be network between us." SSH into a checkout box and put numbers on it.
$ for i in $(seq 1 12); do curl -sS -o /dev/null -w '%{time_namelookup} %{time_connect} %{time_starttransfer}\n' https://payments.internal/v1/health; done 0.003 0.005 0.041 0.004 0.006 0.042 3.004 3.006 3.045 0.003 0.005 0.040 0.003 0.006 0.043 0.004 0.005 0.041 3.006 3.008 3.049 0.003 0.005 0.042 → connect − namelookup ≈ 2 ms always; starttransfer − connect ≈ 37 ms always. The path and the server are identical on good and bad runs. The 3 s lives entirely in namelookup — and it is the same round 3 s every time. A timeout constant. DNS. Step 2.
One loop and the field has narrowed from "the network" to "name resolution on this box." The payments team is already acquitted — their flat 40 ms p99 was telling the truth, since their clock starts when the request arrives, and these requests were losing three seconds before being sent. Now make DNS confess:
$ cat /etc/resolv.conf nameserver 10.40.0.53 nameserver 10.40.0.54 options timeout:3 attempts:2 ← there is the 3-second constant, in writing $ dig @10.40.0.53 payments.internal ;; communications error to 10.40.0.53#53: timed out ← first resolver: dead $ dig @10.40.0.54 payments.internal payments.internal. 30 IN A 10.40.12.7 ;; Query time: 2 msec ← second resolver: perfectly healthy
The whole story is now in four lines. The first nameserver in the list died around 10:15
(its host was part of a maintenance batch, it later turns out). The record's TTL is 30
seconds, so each checkout process gets 30 seconds of cached, instant resolution, then a
cache miss — and every miss tries the dead resolver first, waits the configured 3 seconds,
fails over to the healthy one, and succeeds. Hence intermittent: most requests ride the
cache; the unlucky ones pay exactly 3.00 s. Hence invisible to payments: the delay happens
before the request exists on the wire. The mitigation is one line — pull
10.40.0.53 from resolv.conf across the fleet, p99 drops back to
180 ms within a minute. The fixes that outlast the incident: restore the resolver, set
options timeout:1 rotate so a single dead entry costs one second instead of
three, and add a resolution-time metric so the next dead resolver pages someone as a
DNS problem instead of arriving dressed as a checkout latency mystery. Total
investigation, about six minutes, and not one step was a guess: the loop isolated the
segment, the round number named the mechanism, resolv.conf supplied the
constant, and two digs found the body.
curl -w with the four timers in a small loop (which segment is slow — if
starttransfer, stop: it's the server) · dig name twice (cached vs miss, and
the Query time) · nc -vz host port (refused vs timeout) ·
ss -ti dst host (the kernel's own rtt and retrans counters) ·
mtr --report --report-cycles 100 host (does loss reach the last hop?). Five
commands, each seconds long, and afterwards you either have the guilty segment or you have
the evidence that the network is innocent.What to write in the incident notes
Network accusations recur, so the write-up's job is to make the next one shorter. Five
things belong in it. First, the verdict in one line, named by segment: "DNS — dead first
resolver plus 3 s timeout," not "network issues." The next person grepping old incidents
for a p99 spike needs the classification more than the narrative. Second, the evidence
chain as raw output: the curl timing lines, the dig query times, the ss -ti
counters, the mtr table — pasted, not paraphrased, because numbers can be re-examined when
someone doubts the conclusion and prose cannot. Third, the acquittals, stated explicitly:
"path clean (retrans 4/34k), server think time flat at 37 ms" — written-down acquittals are
what stop the next incident from re-litigating the same suspects. Fourth, what you changed
and when, to the minute, so the recovery edge in the graphs has an explanation, and what
you deliberately did not change. Fifth, the follow-ups with owners: the resolver restore,
the resolv.conf hardening, the resolution-time metric, the provider ticket
number if this was real path loss. An investigation that ends with an acquittal, a guilty
segment, and a prevention item is one the team only pays for once.
Further reading
- everything curl — the --write-out option — the full catalogue of -w variables beyond the four this page uses, from the author of curl.
- Cloudflare — Path MTU discovery in practice — why ICMP filtering breaks PMTUD and large packets blackhole; the long version of ending 5.
- Richard Steenbergen — A practical guide to (correctly) troubleshooting with traceroute (NANOG 47) — the canonical treatment of ICMP deprioritisation and every other way to misread a path trace.
- Semicolony — TCP — the retransmission, ACK, and window machinery behind every signature in step 5.