ELI5 · Distributed systems

Distributed lock.

A single key, held across machines, that lets only one of them touch a shared thing at a time.

On one machine, stopping two threads from clobbering the same data is easy: a lock lets only one in at a time. Across many machines there is no shared memory to hold that lock in, yet you still sometimes need to ensure only one of them does a thing — run a scheduled job once, charge a card once, be the single leader.

A distributed lock is a single key kept somewhere all the machines can reach. Whoever grabs it gets exclusive permission to act; everyone else waits or backs off. It is harder than a local lock, because now machines and the network between them can fail mid-grip.

  1. shared three machines, one thing
    1

    Many machines, no shared memory — yet only one should touch the shared thing.

  2. lock store
    2

    So you keep a single key in a store every machine can reach.

  3. Got it — mine.
    waiting… waiting…
    3

    Whoever grabs the key gets exclusive permission; the rest wait or back off.

  4. a lease, not forever — TTL
    4

    The key is a lease, not forever: it auto-expires unless the holder keeps renewing.

  5. …wait, still mine?
    stalled expired — handed on
    5

    The danger: a holder stalls past expiry, so the lease is handed on while it still thinks it holds it.

  6. resource seen: 24 #23 stale token refused
    6

    A fencing token fixes it: each grant gets a higher number, and the resource refuses any stale one.

One key, held across machines, so only one of them touches the shared thing at a time.

Why a lease, not a forever-lock

If a node grabs a plain lock and then crashes, the lock would be held forever and nobody else could ever proceed. So distributed locks are usually leases with a time-to-live: the lock automatically expires unless the holder keeps renewing it. That guarantees progress even when a holder dies, but it introduces the central danger: the holder might be paused (a long GC pause, a network stall) past the expiry, so the lease is handed to someone else while the first node still believes it holds it.

Fencing tokens close the gap

The fix for that danger is a fencing token: each time the lock is granted, the store hands out a number that only ever increases. The holder includes that token whenever it acts on the protected resource, and the resource rejects any token older than the highest it has seen. So even if a stalled old holder wakes up and tries to write, its stale token is refused. This is why correct distributed locking depends as much on the resource checking tokens as on the lock service itself, and why naive single-node locks can be unsafe under failure.

The real version Distributed lock simulator →
Found this useful?