05 / 08
Errors / 05

context deadline exceeded

A context’s timer ran out before the operation finished. The message tells you a budget was blown, not where — the work is finding which hop spent the time, and whether the server ever even saw the request.


The symptom

A Go call returns context.DeadlineExceeded, or a gRPC call fails with code DeadlineExceeded. The same expiry also shows up in disguise downstream — a database cancelling your statement is often your own deadline arriving there first.

Get "http://inventory.svc:8080/v1/stock": context deadline exceeded
rpc error: code = DeadlineExceeded desc = context deadline exceeded
pq: canceling statement due to user request
  ← that last one is Postgres’s view of YOUR context expiring: the driver
    cancelled the statement when ctx fired

The diagnosis

1 Find whose deadline actually fired

// contexts inherit: the EARLIEST deadline in the chain wins.
// log the remaining budget where work starts:
if d, ok := ctx.Deadline(); ok {
    log.Printf("budget remaining: %s", time.Until(d))
}
// budget remaining: 38ms     ← caller handed you almost nothing:
//                              the time was spent upstream before you ran

Your own WithTimeout(2s) is meaningless under a parent that expires in 100ms. Walk the chain: every context.WithTimeout / WithDeadline between the entrypoint and the failing call, including middleware and client defaults. Logging time-until-deadline at each hop turns "something timed out" into "hop 2 consumed 90% of the budget" — which is a finding, not a mystery.

2 Split connection time from response time

$ curl -w "dns %{time_namelookup}  connect %{time_connect}  tls %{time_appconnect}  ttfb %{time_starttransfer}  total %{time_total}\n" \
       -o /dev/null -s http://inventory.svc:8080/v1/stock
dns 0.412  connect 0.413  tls 0.000  ttfb 2.318  total 2.319
  ← 412ms in DNS before a byte moved, then ~1.9s waiting on the server:
    two separate problems, both hiding under one "deadline exceeded"

Time in namelookup: the resolver — walk it with dig. Time in connect: network path or a full accept queue on the server. Time in ttfb: the server got the request and is slow producing the answer. In Go, net/http/httptrace gives the same split in-process. The point of the split: "raise the timeout" is only the right fix for exactly one of these.

3 Check what the server saw

# client said:  deadline exceeded after 2s
# server access log for the same request id:
10.0.3.41 "GET /v1/stock" 200 31ms          ← server was fast and fine
#   …or:
10.0.3.41 "GET /v1/stock" 200 4920ms        ← server finished AFTER the client left

Three stories. Server never logged it: the request never arrived — connection-level problem, look at step 2’s connect phase. Server answered fast: the budget was consumed before the request reached it (often DNS, dialing, or a queue in a proxy). Server finished after the client gave up: the server burned real resources for an abandoned caller — make sure the context is propagated into everything (QueryContext, not Query) so server work cancels when the client leaves.

4 gRPC: confirm the channel ever became ready

rpc error: code = DeadlineExceeded desc = context deadline exceeded
# same code whether the server was slow OR the connection never existed.
$ GRPC_GO_LOG_VERBOSITY_LEVEL=2 GRPC_GO_LOG_SEVERITY_LEVEL=info ./client 2>&1 | grep -i "channel\|resolver"
INFO: [core] Channel switches to new LB policy "pick_first"
INFO: [core] Subchannel Connectivity change to CONNECTING
INFO: [core] Subchannel Connectivity change to TRANSIENT_FAILURE   ← never connected

A gRPC DEADLINE_EXCEEDED with the channel stuck in CONNECTING/TRANSIENT_FAILURE means name resolution, TLS, or reachability — the server’s speed is unmeasured because no request ever left. Note that gRPC propagates your deadline to the server in the grpc-timeout header, so a well-behaved server stops working the moment the budget is gone.

The causes, ranked

  1. 1 A downstream dependency is genuinely slow

    confirm Traces or server-side timings show the time going into one call — a query, a cold cache, a third-party API.

  2. 2 Budget mismatch across hops

    confirm Logging time.Until(deadline) at each hop shows requests arriving with single-digit milliseconds left.

  3. 3 It never connected: DNS, dial, or TLS stall

    confirm curl -w / httptrace shows the time in namelookup or connect; gRPC channel logs show TRANSIENT_FAILURE.

  4. 4 Server overload — the time is spent queueing, not working

    confirm Server handler times look healthy but client-observed latency is far higher; listener accept-queue overflows climb (nstat TcpExtListenOverflows).

The fixes

A downstream dependency is genuinely slow

Fix that dependency (index the query, warm the cache) or deliberately re-budget: raise this hop’s share and lower another’s. Raising the top-level timeout without re-budgeting just moves where users wait.

Budget mismatch across hops

Practice deadline budgeting: the entrypoint sets the total, each hop gets an explicit fraction, and retries must fit inside the parent budget (n retries of a t-timeout call need roughly n×t plus backoff — if that exceeds the parent, the last retry was always doomed).

It never connected: DNS, dial, or TLS stall

Give dialing its own, shorter timeout (a 5s total budget should not allow a 5s dial), fix the resolver, and keep connections warm with pools/keepalives so the dial cost is paid rarely.

Server overload — the time is spent queueing, not working

Add concurrency limits and load shedding so the server fails fast instead of slowly; an early explicit rejection is cheaper for everyone than a timeout. Then add capacity if the shed rate says so.

What people get wrong

  • The client timing out doesn’t stop the server. Unless the context is threaded through every layer — HTTP request, database driver, downstream RPC — the server keeps computing for a client that already left. Under load this is a death spiral: timeouts breed retries, retries breed more abandoned work. Propagate ctx into everything that accepts one.
  • DeadlineExceeded and Canceled are different errors. context.DeadlineExceeded means the timer fired; context.Canceled means someone called cancel() — frequently because the caller’s own deadline fired upstream and cancellation cascaded down. Distinguish with errors.Is, never by string matching, and log which one you got: they point at different layers.
  • Retrying a timeout inside the same budget is theatre. A retry inherits the same parent context. If the first attempt consumed the budget, the retry times out instantly — visible in logs as pairs of failures milliseconds apart. Retries need their own sub-budgets, and ideally hedging policies decided per-route, not a blanket wrapper.

Quick answers

Does DEADLINE_EXCEEDED mean the server is slow?

Not necessarily. The same error covers a slow server, a connection that never got established (DNS, TLS, reachability), time consumed upstream before the request was sent, and queueing in front of the handler. Split connect time from response time (curl -w or httptrace) before concluding anything about the server.

How do I find which timeout actually fired?

The earliest deadline in the context chain wins, so inventory every WithTimeout/WithDeadline from the entrypoint down, including middleware and client library defaults. Logging time.Until(deadline) at each hop shows exactly where the budget went — requests arriving with a few milliseconds left name the culprit hop.

Does the server stop working when the client times out?

Only if cancellation propagates. gRPC sends the deadline to the server (grpc-timeout header) and cancels server-side contexts; plain HTTP servers see the connection close only if they check. Anything not given the context — a Query instead of QueryContext — runs to completion for nobody.

Related on Semicolony

Found this useful?