ELI5 · Distributed systems

Consensus (Raft).

A group agreeing on one version of the story, even when some members go quiet.

Several machines are keeping copies of the same data for safety. But how do they agree on the order of changes, so the copies never disagree, even if a machine crashes mid-update?

They run a consensus algorithm. Raft is the popular, readable one, and it works by electing a single leader everyone follows — like a committee that picks a chairperson to avoid everyone talking over each other.

  1. No, MY version is right!
    mine! no, mine! mine!
    1

    Several machines keep copies of the same data. Left to talk over each other, the copies would drift apart.

  2. You take the chair.
    2

    So they vote one of themselves in as leader — the chair everyone now follows.

  3. Proposing: set x = 5.
    leader
    3

    Every change goes through the leader, who writes it down in order and proposes it to the rest.

  4. Three of five agree — committed.
    3 of 5 — committed
    4

    It’s only locked in once more than half the machines have stored it. A majority makes it official.

  5. they always share one
    5

    Why majority? Any two majorities share at least one machine, and it refuses to back two conflicting stories.

  6. Chair’s gone quiet — new vote.
    quiet new chair, no data lost
    6

    If the leader goes quiet, the survivors notice and elect a new one. A brief pause, but no data lost.

A committee picks one chair, who proposes; nothing is official until a majority agrees.

Why "majority" is the magic word

Nothing becomes official until a majority of machines agree. This single rule prevents the nightmare of two machines both believing they are in charge and writing conflicting history.

Two different majorities of the same group must always overlap by at least one machine, and that shared machine refuses to agree to two conflicting things. So the group can never split into two valid truths.

Surviving failures gracefully

As long as more than half the machines are alive and able to talk, the group keeps making progress. A five-machine cluster can lose two and carry on.

If the leader crashes, the rest wait a moment, notice it has gone quiet, and hold a new election. There may be a brief pause, but no data is lost and no contradiction creeps in. This is how systems like etcd (which Kubernetes relies on) stay trustworthy.

The real version How Raft consensus works →
Found this useful?