Replication
Every replication system is the same machine: a leader produces an ordered log of changes, and followers consume it. The differences that matter are what goes in the log (statements, rows, or raw WAL bytes), when the leader waits for followers (sync or async), and how many leaders there are. Get those three choices right for your workload and replication is boring. Get them wrong and you lose committed writes during a failover, on a Tuesday, in front of everyone.
The setup
You replicate for four reasons, and they pull in different directions. Durability: a second copy survives the first machine dying. Read throughput: followers absorb read traffic the leader doesn't have to serve. Latency: a replica near the user answers reads without a cross-region round trip. Disaster recovery: a copy in another region survives the whole datacentre going away. A read replica in the same rack helps with throughput and nothing else; a cross-region async replica helps with DR and makes the consistency story harder. Most production fleets want some of each.
Mechanically, replication is almost always "ship the log". The write-ahead log the database already produces for crash recovery is an ordered, complete record of every change, so the cheapest replication design is to stream it to another machine and replay it. That's why this page sits right after the WAL deep dive: replication is the WAL, pointed at the network. The plain-English version of this whole page lives at ELI5: replication; the distributed-systems angle (quorums, leaderless designs, consistency guarantees) is in the distributed-systems topic. This page is about what the database engines actually do.
What goes in the log: statement, row, logical
The first design decision is the unit of replication.
- Statement-based. Ship the SQL text —
UPDATE accounts SET balance = balance - 100 WHERE id = 7— and re-execute it on the follower. Compact and human-readable, but any non-determinism diverges the copies:NOW(),RAND(), auto-increment values, anUPDATE ... LIMITwithout anORDER BY, a trigger that behaves differently. MySQL shipped this as its original binlog format and spent a decade patching the non-determinism cases (it ships the timestamp alongside the statement, for instance) before making row-based the default in 5.7. - Row-based (physical-logical). Ship the effect, not the cause: "row
with primary key 7 changed from these column values to these". Deterministic by
construction, and a single statement that touches a million rows produces a million
row events — which is the trade-off. This is MySQL's
binlog_format = ROWand what MongoDB's oplog stores. - Physical (byte-level). Ship the raw WAL records and apply them block-for-block to identical on-disk pages. The follower is a byte-for-byte clone: same engine version, same architecture, same indexes, same bloat. Replay is cheap because it skips the executor entirely. This is Postgres streaming replication.
- Logical. Decode the physical WAL back into row-level changes and ship those. You get row-based semantics from a physical log: the consumer can be a different Postgres version, a subset of tables, or not a database at all (Debezium into Kafka). Postgres logical replication and the whole CDC ecosystem live here.
Sync, async, and the middle
The second decision: does a commit wait for followers?
- Asynchronous. The leader acks the client as soon as its own WAL fsync returns, and ships log records best-effort. Write latency is identical to a standalone database. The cost: if the leader dies, whatever the followers hadn't received yet is gone. Promote a lagging follower and those committed transactions never happened.
- Synchronous. The commit doesn't return until at least one follower has acknowledged the log record. Now a leader crash loses nothing — there is always another copy of every acked commit. The cost is a network round trip on every commit path, and a harder question: what happens when the synchronous follower is down? If the answer is "writes block", a replica failure becomes a leader outage. If it's "fall back to async", you've built async with extra steps.
- Semi-sync / quorum. The practical middle. Wait for any K of N
followers rather than one specific machine, so a single replica failure doesn't
stall commits. Postgres expresses this directly:
synchronous_standby_names = 'ANY 1 (replica_a, replica_b)'. MySQL's semi-synchronous plugin waits for a follower to receive (not apply) the binlog event. Postgres lets you pick the wait point per session withsynchronous_commit = remote_write | on | remote_apply— received, flushed, or actually applied and visible to reads on the standby.
A common production shape: one synchronous standby in a nearby zone (durable ack at low RTT), plus async replicas across regions for reads and DR. Durability where it's cheap, availability where sync would hurt.
One leader or several
The third decision: who accepts writes?
- Leader-follower. One node takes all writes; everyone else replays the log. Conflicts are impossible by construction, because there's one serial order of writes — the leader's. This is the default everywhere: Postgres, MySQL, MongoDB replica sets, SQL Server availability groups. The price is that write throughput is capped by one machine (the cure for that is sharding, next page) and that losing the leader is an event you have to handle.
- Multi-leader. Several nodes accept writes and exchange changes, typically one leader per region. Writes are fast everywhere, and a region can keep accepting writes through a partition. The price is conflicts: two leaders accept incompatible writes to the same row, and now the system needs a resolution rule — last-writer-wins (which silently drops one write), per-column merge, application callbacks, or CRDTs. MySQL circular replication, Postgres BDR, and most "active-active" setups live here, and the conflict-handling story is the part to interrogate hardest.
- Leaderless. No distinguished node: clients write to several replicas and read with overlapping quorums. Dynamo, Cassandra, Riak. A different family with different failure modes — how Cassandra replicates covers it; this page stays with the log-shipping family.
How the engines actually do it
Postgres: streaming replication. A standby connects over the
replication protocol; a walsender process on the primary streams raw WAL
records as they're produced, and the standby's walreceiver writes and
applies them in continuous recovery mode. The standby can serve reads (hot
standby) while it replays. Two pieces of plumbing matter operationally:
replication slots, which make the primary retain WAL until a given
standby has consumed it (a disconnected standby with a slot can fill the primary's
disk — the safety feature is also the footgun, which is what
max_slot_wal_keep_size bounds), and recovery conflicts —
replay can need to remove row versions a long-running standby query still depends on, so
you choose between delaying replay, cancelling the query, or letting
hot_standby_feedback hold back
vacuum on the primary.
Logical replication (publications and subscriptions, since Postgres 10) runs through the
same WAL via logical decoding.
MySQL: the binlog. MySQL's replication log is not its crash-recovery log. InnoDB keeps its own redo log for recovery; the binary log is a separate, server-level log written at commit, in statement, row, or mixed format, and that's what replicas consume. Keeping the two logs consistent costs an internal two-phase commit between InnoDB and the binlog on every transaction. Replicas pull binlog events with an I/O thread, spool them to a local relay log, and apply them with SQL threads — historically single-threaded, which made apply speed the classic lag bottleneck; parallel appliers improved this over the 5.7/8.0 era. GTIDs (global transaction IDs) replaced file-and-offset coordinates, which is what makes failover and replica repointing tractable.
MongoDB: the oplog. A replica set's leader writes every change as a
document in a capped collection, local.oplog.rs, and followers tail it like
a cursor. Two design choices do a lot of work: oplog entries are idempotent
(an $inc is stored as the resulting value, not the increment), so a
follower can re-apply a window of the oplog after a restart without arithmetic drift;
and the oplog is a fixed-size ring, so a follower that falls behind by more than the
window can't catch up incrementally and must full-resync. Failover is built in: the
replica set runs an election protocol (Raft-derived since v1 of the protocol was
retired) and writeConcern: majority gives you the sync-replication
guarantee per write.
Aurora: the log is the database. Amazon Aurora takes the "WAL is the truth, pages are cache" idea to its conclusion. The MySQL- or Postgres-compatible compute node never writes data pages at all — it ships only redo log records to a purpose-built storage service, which keeps six copies across three availability zones and acknowledges writes at a 4-of-6 quorum. The storage layer materialises pages from the log itself, continuously and on demand, so there are no checkpoints, no full-page writes, and crash recovery doesn't replay a log at startup — the log already lives in the storage tier. Replicas are cheap because they share that storage and only need the log stream to invalidate their cached pages, not to rebuild a full copy. The Verbitski et al. SIGMOD 2017 paper is the reference, and its argument — replicate the log, not the pages, because the log is smaller and the network is the bottleneck — is the cleanest statement of what this whole page is about.
| Engine | What ships | Sync options | Failover story |
|---|---|---|---|
| Postgres | Raw WAL (physical) or decoded rows (logical) | synchronous_commit + synchronous_standby_names (ANY/FIRST k) | External tooling: Patroni, repmgr, cloud managers |
| MySQL | Binlog events (row, statement, or mixed) | Async by default; semi-sync plugin waits for receipt | External: orchestrator, group replication, ProxySQL |
| MongoDB | Oplog documents (idempotent row-level ops) | Per-write writeConcern (1, majority, n) | Built-in elections inside the replica set |
| Aurora | Redo log records only, to shared storage | 4-of-6 storage quorum per write | Managed promotion onto shared storage volume |
Replication lag
Async followers are always a little behind, and "a little" is unbounded. Lag comes from the apply side more often than the network: a single apply thread replaying a multi-threaded leader's write volume, one long transaction on the leader producing a burst the follower chews through serially, replay stalling behind a recovery conflict, or an under-provisioned replica doing the same writes on slower disks.
The damage is application-visible anomalies, not just a metric. A user writes to the
leader, reads from a follower, and their own write is missing — no
read-your-writes. Two successive reads hit differently-lagged replicas and time
appears to move backwards — no monotonic reads. The taxonomy of these
guarantees is the
consistency models
topic; the database-side fixes are routing (pin a session to the leader after it
writes, or until the replica catches up to the session's last-write LSN) and the
stronger sync modes above (remote_apply makes a chosen standby
read-your-writes by construction, at commit-latency cost).
pg_stat_replication gives you the gap between the primary's current WAL
LSN and each standby's sent/flushed/replayed LSN — an exact byte distance. MySQL's
famous Seconds_Behind_Source is derived from event timestamps and reads 0
when the I/O thread is the thing that's behind, which is exactly when you most need the
truth. Alerting on a timestamp heartbeat table or GTID distance is the standard
workaround.Failover
Failover is where every choice above gets graded. The sequence is: detect that the leader is gone, pick a new one, repoint everything. Each step has a classic failure mode.
- Detection. "The leader is down" and "we can't reach the leader" are indistinguishable from the outside — failure detection is timeouts and judgement. Too aggressive and you fail over during a GC pause or a network blip; too conservative and you're down for minutes.
- Promotion. With async replication, the most caught-up follower is still possibly behind, and promoting it silently discards the gap — committed transactions the old leader acked and nobody else had. Sync replication exists precisely to make this gap empty. The choice of which follower, made by more than one party at once, is a leader election problem.
- Fencing. The worst outcome is the old leader not being dead — it comes back from its network partition still believing it's the leader, accepts writes, and now two divergent histories exist (split brain). Real failover systems fence the old leader: revoke its storage access, kill it via management plane (STONITH — shoot the other node in the head), or use leases so leadership simply expires.
Because getting all three right requires consensus, the credible Postgres HA tooling outsources the decision: Patroni stores leadership state in etcd or Consul (a Raft-backed store) so there can only be one leader on record, and the Postgres processes follow that record. MongoDB builds the same machinery into the replica set itself. The pattern is identical either way: replication ships the data, but failover safety comes from a consensus layer deciding who the leader is.
Further reading
- Postgres docs — Log-Shipping Standby Servers Streaming replication, replication slots, synchronous commit modes, hot standby.
- Postgres docs — Logical Replication Publications, subscriptions, and what logical replication does and doesn't carry.
- MySQL Reference Manual — Replication Binlog formats, GTIDs, semi-synchronous replication, multi-threaded appliers.
- MongoDB docs — Replication Replica sets, the oplog, elections, read and write concerns.
- Verbitski et al. — Amazon Aurora (SIGMOD 2017) The "log is the database" paper: 6-way replicated storage, 4-of-6 write quorums, no checkpoints.
- Kleppmann — DDIA, chapter 5 The best single chapter on replication topologies, lag anomalies, and conflict resolution.
- Semicolony — Replication (distributed systems) The same subject from the distributed-systems side: quorums, leaderless designs, anti-entropy.