13 / 20
Topics / 13

Anti-entropy & read repair

A leaderless, quorum-based store accepts writes even when some replicas are down or unreachable. That keeps it available, but it also means replicas drift apart: one has the latest value, another has a stale one, a third missed the write entirely. Anti-entropy is the family of background and on-path mechanisms that drag those replicas back into agreement — read repair while serving a request, hinted handoff while a node is down, and Merkle-tree comparison to find divergence cheaply across enormous key ranges.


Why replicas diverge

In a Dynamo-style store, a write is accepted once W of N replicas ack it. If W < N, the remaining replicas are behind the moment the write returns. Add node failures, dropped messages, and network partitions, and divergence is the normal state, not the exception. The quorum overlap W + R > N guarantees a read sees a fresh copy; it does nothing to fix the stale ones. Anti-entropy is what eventually makes them all agree.

Read repair — heal on the read path

When a coordinator reads from R replicas and notices they disagree, it picks the winning version (highest timestamp, or by merging version vectors) and writes it back to the stale replicas before returning to the client. The data heals as a side effect of being read.

It is cheap and opportunistic, but it only repairs keys that are actually read. There are two flavours:

  • Foreground (blocking) read repair. The coordinator waits for the repair write before responding. Stronger guarantee, higher read latency.
  • Background read repair. The coordinator returns immediately and repairs asynchronously. Cassandra's probabilistic read_repair_chance took this route. Faster reads, weaker convergence.

Read repair leaves a gap: cold data that is rarely read can stay stale indefinitely. That is what the other two mechanisms exist to cover.

Hinted handoff — buffer while a node is down

If a target replica is unreachable when a write arrives, the coordinator can store a hint — a note saying "replica C owes this write" — on another node, and ack the write immediately (a sloppy quorum). When C comes back, the holder replays the hinted writes to it, and the hint is dropped.

This keeps writes flowing through a transient outage without losing them. The caveats are real: hints accumulate while a node is down for a long time, and they have a TTL (Cassandra's max_hint_window_in_ms, default 3 hours). Past the window, hints are discarded and the down node must be repaired by full anti-entropy instead. Hinted handoff is a stopgap, not a durability guarantee.

Sloppy quorums and hinted handoff go together. A sloppy quorum lets the W acks come from any W reachable nodes, not just the designated replicas. Hinted handoff is the bookkeeping that makes sure the designated replicas eventually catch up. Without it, a sloppy quorum quietly loses writes.

Merkle trees — find divergence cheaply

To reconcile two replicas holding millions of keys, you cannot ship every key and compare. Merkle trees solve this. Each replica builds a hash tree over its key range: leaves hash individual keys (or buckets of keys), and every internal node hashes its children. The root summarises the entire range in one hash.

Two replicas compare roots. If the roots match, the ranges are identical — nothing to do. If they differ, the replicas walk down only the subtrees whose hashes disagree, halving the search at each level. A single divergent key is located in O(log n) hash comparisons, and only the genuinely different data is transferred.

Dynamo introduced this for inter-replica repair; Cassandra exposes it as nodetool repair, Riak as active anti-entropy. The same structure backs Git's object model and blockchain block verification, for the same reason: efficient comparison of large datasets by exchanging tiny hashes.

The three mechanisms together

MechanismWhen it runsCovers
Read repairOn the read path, per requestHot keys that are actually read
Hinted handoffDuring a transient node outageWrites that missed a temporarily-down replica
Merkle anti-entropyScheduled / on-demand background sweepCold data, long outages, bit rot — the rest

Read repair handles the data people touch, hinted handoff handles short outages, and Merkle-tree anti-entropy is the periodic backstop that guarantees full convergence regardless of access patterns. Together they turn "eventually consistent" from a hope into a property the system actively enforces.

Common misunderstandings

  • "Read repair makes the store consistent." It makes the keys you read converge. Unread keys can drift until anti-entropy runs. It is a convergence aid, not a consistency model.
  • "Hinted handoff means writes are never lost." Hints expire. Beyond the hint window, only a full repair recovers the down replica's missing data — which is why you schedule regular anti-entropy.
  • "Anti-entropy is free." Merkle builds and repair streaming consume CPU, disk, and network. Running nodetool repair on a busy cluster is a real operational event that needs throttling.

Further reading

Found this useful?