ELI5 · Distributed systems

Exponential backoff.

When a call fails, wait a little, then twice as long, then twice again — instead of hammering away.

When a request to another service fails, retrying is reasonable — the blip might be temporary. But retrying immediately and forever is a great way to make things worse: a struggling service gets pounded by a flood of instant retries and never gets room to recover.

Exponential backoff is the polite way to retry. Wait a short moment, try again; if it still fails, wait twice as long; then twice as long again. Each failure backs you off further, so a brief glitch is handled quickly while a real outage is met with patience instead of a stampede.

  1. knock knock no answer
    1

    You knock on a door and nobody answers — the call to another service just failed.

  2. wait 1s, then retry
    2

    So you wait a short moment, then try again — a quick first retry handles a blip.

  3. 1s 2s 4s 8s ×2
    3

    Still failing? Each wait doubles: 1s, 2s, 4s, 8s — backing off further every time.

  4. struggling a retry storm keeps it down
    4

    This matters because everyone retrying instantly is a stampede that keeps the service down.

  5. jitter scatters them
    5

    Adding jitter, a little randomness, scatters the retries so they do not arrive in sync.

  6. cap cap the wait, then give up
    6

    Finally, cap how long you wait and stop after a limit, so you never retry forever.

Knock again, but wait longer each time, with a random spread and a limit.

Why doubling the wait helps

If a dependency is briefly overloaded, the worst thing every client can do is retry instantly and in unison — that "retry storm" keeps the service flat on its back. Backing off exponentially means clients quickly thin out their attempts, giving the struggling service breathing space to drain its backlog and recover. A short glitch still gets a fast first retry, so you do not pay much for the common case.

Jitter and limits make it work

Pure doubling has a flaw: if many clients failed at the same instant, they all retry at the same instants too — synchronised waves of load. Adding jitter (a random spread to each wait) scatters the retries so they no longer arrive together. Just as important, you cap the maximum delay and stop retrying after a set number of attempts or a deadline, so a permanently-down dependency does not leave callers retrying forever.

The real version Retry strategy simulator →
Found this useful?