connection reset by peer

Something on the path sent a TCP RST instead of a polite close — usually an idle-timeout mismatch in a connection pool, a crashed peer, or a middlebox that forgot your connection. The investigation is finding who sent it.

The symptom

Reads or writes on an established connection fail with ECONNRESET. The wording varies by stack, the packet underneath is the same: a TCP segment with the RST flag arrived, and the kernel tore the connection down.

read tcp 10.0.3.41:48732->10.0.7.12:5432: read: connection reset by peer   (Go)
Error: read ECONNRESET                                                    (Node.js)
ConnectionResetError: [Errno 104] Connection reset by peer                (Python)
recv() failed (104: Connection reset by peer) while reading
    response header from upstream                                         (nginx)

The diagnosis

1 Establish the timing pattern first

# three questions, answerable from your own logs/metrics:
#   1. does it hit on the FIRST read/write after a connection sat idle?
#   2. does it hit mid-transfer, partway through a response?
#   3. does it cluster around deploys, restarts, or traffic spikes?
  ← idle-then-reset   → pool reusing a connection the other side reaped (cause 1)
  ← mid-transfer      → peer died or a middlebox lost state (causes 2, 4)
  ← at deploys        → servers closing non-gracefully (cause 2)

The pattern narrows it before any packet capture. The single most common shape in service-to-service traffic is the first one: a client pool hands out a connection that idled past the server’s or load balancer’s keep-alive timeout, and the first use of the corpse gets the RST.

2 Catch the RST on the wire

$ sudo tcpdump -ni any "tcp[tcpflags] & (tcp-rst) != 0 and host 10.0.7.12"
11:42:09.318 IP 10.0.7.12.5432 > 10.0.3.41.48732: Flags [R.], seq 1, ack 4821, ttl 62
  ← the RST claims to come from the peer — but check ttl against the peer’s
    normal packets (say 64-hops-consistent): a different ttl on the RST means
    a middlebox forged it in the peer’s name

The source address names the sender, with one caveat: firewalls and other middleboxes inject RSTs using the peer’s address. The tell is the IP TTL — a forged RST has taken a different path, so its TTL usually differs from the peer’s genuine traffic. Same TTL as normal traffic: believe the peer sent it (causes 2–3). Different TTL: a box in the middle is killing your connections (cause 4).

3 Line up the idle timeouts end to end

# every hop has one; the defaults rarely agree:
client pool   idleTimeout            e.g. 90s   ← must be the SMALLEST
load balancer idle timeout            e.g. 60s   (cloud LBs: commonly 60-350s)
server        keepalive_timeout       e.g. 75s   (nginx default)
  ← here the client happily reuses connections the LB killed at 60s:
    every reuse in the 60-90s window is an ECONNRESET waiting to happen

The rule: the client must abandon idle connections before anything downstream does. Audit client pool idle timeout against the minimum of every hop between it and the server. Fixing this one inequality removes the steady background drip of resets most fleets have learned to ignore.

4 Check for resource-pressure resets on the server side

$ ss -s
TCP:   14310 (estab 9120, closed 4870, orphaned 122, timewait 4731)
$ nstat -az TcpExtListenOverflows TcpExtListenDrops
TcpExtListenOverflows   18804   ← accept queue overflowing: server can’t keep up
$ dmesg -T | grep conntrack
[11:40:02] nf_conntrack: table full, dropping packet
  ← either of these turns "random resets" into a capacity problem with a name

Listen-queue overflows mean connections die at the server’s front door under load. A full conntrack table on any NAT/firewall hop means established connections get dropped mid-life and subsequent packets answered with RSTs. Both are fleet-level findings: fix capacity or table sizes, not the client code.

The causes, ranked

1 Idle keep-alive reaped downstream; client reuses the dead connection

confirm Resets occur on the first operation after an idle gap, and the gap consistently exceeds some hop’s idle timeout. Reproducible: hold a connection idle past the suspect timeout, then use it.
2 The peer process crashed or restarted with connections open

confirm Reset timestamps correlate with the peer’s restarts, OOM kills, or deploys in your logs.
3 Close with unread data — a protocol-level RST

confirm tcpdump shows the server responding early (an error status) and closing while the client is still sending the request body; the RST follows immediately. The socket API mandates RST on close when received data is unread (see RFC 9293 reset generation).
4 A middlebox dropped its state table entry and resets the survivors

confirm RST TTL differs from the peer’s real packets; long-lived but quiet connections die after a suspiciously fixed interval.

The fixes

Idle keep-alive reaped downstream; client reuses the dead connection

Set the client pool’s idle timeout below the minimum of server and LB timeouts, enable connection validation or TCP keepalives on pooled connections, and retry idempotent requests once when a reused connection resets — mature HTTP clients do this automatically.

The peer process crashed or restarted with connections open

Make shutdown graceful on the server: stop accepting, drain in-flight requests, then close listeners and exit on SIGTERM. The kill & signals page covers the sequencing; the orchestrator’s grace period must exceed your drain time.

Close with unread data — a protocol-level RST

Clients must drain or at least read responses even on error paths. For large uploads, send Expect: 100-continue so the server can reject before the body flows.

A middlebox dropped its state table entry and resets the survivors

Keep state alive: TCP keepalives (or app-level pings) at an interval below the middlebox’s conntrack/idle timeout, or raise that timeout where you control the box. For NAT-heavy paths, prefer shorter-lived connections over heroic keepalive tuning.

What people get wrong

A reset on a reused pooled connection is a race, not an outage. Even with perfect timeout ordering, the server may close at the exact moment the client transmits. Every robust HTTP client treats ECONNRESET on a reused connection as retryable-once for idempotent requests. The real bug is blind retries of non-idempotent requests — that’s how duplicate orders happen.
Reset vs refused are different bugs. "Connection refused" is an RST answering your SYN: nothing is listening — wrong port, crashed server, firewall reject. "Connection reset by peer" happens on an established connection. Diagnosing one with the other’s playbook wastes the afternoon.
The RST’s source address can lie. Middleboxes forge resets in the peer’s name; the TTL comparison in step 2 is the cheap forensic. If you skip it, you’ll debug the peer’s code for a firewall’s behaviour.

Quick answers

Is "connection reset by peer" a client-side or server-side problem?

The error appears on whichever side was using the connection when the RST arrived, but the cause is usually a relationship problem: the client’s idle timeout is longer than the server’s or load balancer’s, so the client reuses connections the other side already closed. Audit the timeout chain before blaming either codebase.

How do I find out what actually sent the RST?

Capture it: tcpdump with a filter on the RST flag. The source address names the sender, and comparing the RST’s IP TTL with the peer’s normal packets exposes middlebox-forged resets — a different TTL means a firewall or NAT device built that packet, not the peer.

Why does it only happen after the connection sits idle?

Because some hop reaped the idle connection — server keep-alive timeout, LB idle timeout, or a NAT table expiry — and the client pool didn’t know. The first read or write on the dead connection gets the reset. Fix: client idle timeout strictly below every downstream timeout, plus one retry for idempotent requests.

Related on Semicolony

Next error

context deadline exceeded

A context’s timer ran out before the operation finished. The message tells you a budget was blown, not where — the work is finding which hop spent the time, and whether the server ever even saw the request.

Open

← ImagePullBackOff ↑ All errors

Found this useful?