07 / 14
Internals / 07

Replication

Every replication system is the same machine: a leader produces an ordered log of changes, and followers consume it. The differences that matter are what goes in the log (statements, rows, or raw WAL bytes), when the leader waits for followers (sync or async), and how many leaders there are. Get those three choices right for your workload and replication is boring. Get them wrong and you lose committed writes during a failover, on a Tuesday, in front of everyone.


The setup

You replicate for four reasons, and they pull in different directions. Durability: a second copy survives the first machine dying. Read throughput: followers absorb read traffic the leader doesn't have to serve. Latency: a replica near the user answers reads without a cross-region round trip. Disaster recovery: a copy in another region survives the whole datacentre going away. A read replica in the same rack helps with throughput and nothing else; a cross-region async replica helps with DR and makes the consistency story harder. Most production fleets want some of each.

Mechanically, replication is almost always "ship the log". The write-ahead log the database already produces for crash recovery is an ordered, complete record of every change, so the cheapest replication design is to stream it to another machine and replay it. That's why this page sits right after the WAL deep dive: replication is the WAL, pointed at the network. The plain-English version of this whole page lives at ELI5: replication; the distributed-systems angle (quorums, leaderless designs, consistency guarantees) is in the distributed-systems topic. This page is about what the database engines actually do.

What goes in the log: statement, row, logical

The first design decision is the unit of replication.

  • Statement-based. Ship the SQL text — UPDATE accounts SET balance = balance - 100 WHERE id = 7 — and re-execute it on the follower. Compact and human-readable, but any non-determinism diverges the copies: NOW(), RAND(), auto-increment values, an UPDATE ... LIMIT without an ORDER BY, a trigger that behaves differently. MySQL shipped this as its original binlog format and spent a decade patching the non-determinism cases (it ships the timestamp alongside the statement, for instance) before making row-based the default in 5.7.
  • Row-based (physical-logical). Ship the effect, not the cause: "row with primary key 7 changed from these column values to these". Deterministic by construction, and a single statement that touches a million rows produces a million row events — which is the trade-off. This is MySQL's binlog_format = ROW and what MongoDB's oplog stores.
  • Physical (byte-level). Ship the raw WAL records and apply them block-for-block to identical on-disk pages. The follower is a byte-for-byte clone: same engine version, same architecture, same indexes, same bloat. Replay is cheap because it skips the executor entirely. This is Postgres streaming replication.
  • Logical. Decode the physical WAL back into row-level changes and ship those. You get row-based semantics from a physical log: the consumer can be a different Postgres version, a subset of tables, or not a database at all (Debezium into Kafka). Postgres logical replication and the whole CDC ecosystem live here.
Physical vs logical is a real fork. Physical replication can't skip a table, can't cross major versions, and replicates the bloat along with the data — but it replicates everything (DDL, sequences, all of it) with no per-table setup. Logical replication is selective and version-flexible, but in Postgres it historically didn't replicate DDL or sequence state, so schema migrations need their own choreography. Teams usually run physical replication for HA and logical replication for CDC and migrations, side by side.

Sync, async, and the middle

The second decision: does a commit wait for followers?

  • Asynchronous. The leader acks the client as soon as its own WAL fsync returns, and ships log records best-effort. Write latency is identical to a standalone database. The cost: if the leader dies, whatever the followers hadn't received yet is gone. Promote a lagging follower and those committed transactions never happened.
  • Synchronous. The commit doesn't return until at least one follower has acknowledged the log record. Now a leader crash loses nothing — there is always another copy of every acked commit. The cost is a network round trip on every commit path, and a harder question: what happens when the synchronous follower is down? If the answer is "writes block", a replica failure becomes a leader outage. If it's "fall back to async", you've built async with extra steps.
  • Semi-sync / quorum. The practical middle. Wait for any K of N followers rather than one specific machine, so a single replica failure doesn't stall commits. Postgres expresses this directly: synchronous_standby_names = 'ANY 1 (replica_a, replica_b)'. MySQL's semi-synchronous plugin waits for a follower to receive (not apply) the binlog event. Postgres lets you pick the wait point per session with synchronous_commit = remote_write | on | remote_apply — received, flushed, or actually applied and visible to reads on the standby.

A common production shape: one synchronous standby in a nearby zone (durable ack at low RTT), plus async replicas across regions for reads and DR. Durability where it's cheap, availability where sync would hurt.

One leader or several

The third decision: who accepts writes?

  • Leader-follower. One node takes all writes; everyone else replays the log. Conflicts are impossible by construction, because there's one serial order of writes — the leader's. This is the default everywhere: Postgres, MySQL, MongoDB replica sets, SQL Server availability groups. The price is that write throughput is capped by one machine (the cure for that is sharding, next page) and that losing the leader is an event you have to handle.
  • Multi-leader. Several nodes accept writes and exchange changes, typically one leader per region. Writes are fast everywhere, and a region can keep accepting writes through a partition. The price is conflicts: two leaders accept incompatible writes to the same row, and now the system needs a resolution rule — last-writer-wins (which silently drops one write), per-column merge, application callbacks, or CRDTs. MySQL circular replication, Postgres BDR, and most "active-active" setups live here, and the conflict-handling story is the part to interrogate hardest.
  • Leaderless. No distinguished node: clients write to several replicas and read with overlapping quorums. Dynamo, Cassandra, Riak. A different family with different failure modes — how Cassandra replicates covers it; this page stays with the log-shipping family.

How the engines actually do it

Postgres: streaming replication. A standby connects over the replication protocol; a walsender process on the primary streams raw WAL records as they're produced, and the standby's walreceiver writes and applies them in continuous recovery mode. The standby can serve reads (hot standby) while it replays. Two pieces of plumbing matter operationally: replication slots, which make the primary retain WAL until a given standby has consumed it (a disconnected standby with a slot can fill the primary's disk — the safety feature is also the footgun, which is what max_slot_wal_keep_size bounds), and recovery conflicts — replay can need to remove row versions a long-running standby query still depends on, so you choose between delaying replay, cancelling the query, or letting hot_standby_feedback hold back vacuum on the primary. Logical replication (publications and subscriptions, since Postgres 10) runs through the same WAL via logical decoding.

MySQL: the binlog. MySQL's replication log is not its crash-recovery log. InnoDB keeps its own redo log for recovery; the binary log is a separate, server-level log written at commit, in statement, row, or mixed format, and that's what replicas consume. Keeping the two logs consistent costs an internal two-phase commit between InnoDB and the binlog on every transaction. Replicas pull binlog events with an I/O thread, spool them to a local relay log, and apply them with SQL threads — historically single-threaded, which made apply speed the classic lag bottleneck; parallel appliers improved this over the 5.7/8.0 era. GTIDs (global transaction IDs) replaced file-and-offset coordinates, which is what makes failover and replica repointing tractable.

MongoDB: the oplog. A replica set's leader writes every change as a document in a capped collection, local.oplog.rs, and followers tail it like a cursor. Two design choices do a lot of work: oplog entries are idempotent (an $inc is stored as the resulting value, not the increment), so a follower can re-apply a window of the oplog after a restart without arithmetic drift; and the oplog is a fixed-size ring, so a follower that falls behind by more than the window can't catch up incrementally and must full-resync. Failover is built in: the replica set runs an election protocol (Raft-derived since v1 of the protocol was retired) and writeConcern: majority gives you the sync-replication guarantee per write.

Aurora: the log is the database. Amazon Aurora takes the "WAL is the truth, pages are cache" idea to its conclusion. The MySQL- or Postgres-compatible compute node never writes data pages at all — it ships only redo log records to a purpose-built storage service, which keeps six copies across three availability zones and acknowledges writes at a 4-of-6 quorum. The storage layer materialises pages from the log itself, continuously and on demand, so there are no checkpoints, no full-page writes, and crash recovery doesn't replay a log at startup — the log already lives in the storage tier. Replicas are cheap because they share that storage and only need the log stream to invalidate their cached pages, not to rebuild a full copy. The Verbitski et al. SIGMOD 2017 paper is the reference, and its argument — replicate the log, not the pages, because the log is smaller and the network is the bottleneck — is the cleanest statement of what this whole page is about.

EngineWhat shipsSync optionsFailover story
Postgres Raw WAL (physical) or decoded rows (logical) synchronous_commit + synchronous_standby_names (ANY/FIRST k) External tooling: Patroni, repmgr, cloud managers
MySQL Binlog events (row, statement, or mixed) Async by default; semi-sync plugin waits for receipt External: orchestrator, group replication, ProxySQL
MongoDB Oplog documents (idempotent row-level ops) Per-write writeConcern (1, majority, n) Built-in elections inside the replica set
Aurora Redo log records only, to shared storage 4-of-6 storage quorum per write Managed promotion onto shared storage volume

Replication lag

Async followers are always a little behind, and "a little" is unbounded. Lag comes from the apply side more often than the network: a single apply thread replaying a multi-threaded leader's write volume, one long transaction on the leader producing a burst the follower chews through serially, replay stalling behind a recovery conflict, or an under-provisioned replica doing the same writes on slower disks.

The damage is application-visible anomalies, not just a metric. A user writes to the leader, reads from a follower, and their own write is missing — no read-your-writes. Two successive reads hit differently-lagged replicas and time appears to move backwards — no monotonic reads. The taxonomy of these guarantees is the consistency models topic; the database-side fixes are routing (pin a session to the leader after it writes, or until the replica catches up to the session's last-write LSN) and the stronger sync modes above (remote_apply makes a chosen standby read-your-writes by construction, at commit-latency cost).

Measure lag in log position, not seconds. Postgres' pg_stat_replication gives you the gap between the primary's current WAL LSN and each standby's sent/flushed/replayed LSN — an exact byte distance. MySQL's famous Seconds_Behind_Source is derived from event timestamps and reads 0 when the I/O thread is the thing that's behind, which is exactly when you most need the truth. Alerting on a timestamp heartbeat table or GTID distance is the standard workaround.

Failover

Failover is where every choice above gets graded. The sequence is: detect that the leader is gone, pick a new one, repoint everything. Each step has a classic failure mode.

  • Detection. "The leader is down" and "we can't reach the leader" are indistinguishable from the outside — failure detection is timeouts and judgement. Too aggressive and you fail over during a GC pause or a network blip; too conservative and you're down for minutes.
  • Promotion. With async replication, the most caught-up follower is still possibly behind, and promoting it silently discards the gap — committed transactions the old leader acked and nobody else had. Sync replication exists precisely to make this gap empty. The choice of which follower, made by more than one party at once, is a leader election problem.
  • Fencing. The worst outcome is the old leader not being dead — it comes back from its network partition still believing it's the leader, accepts writes, and now two divergent histories exist (split brain). Real failover systems fence the old leader: revoke its storage access, kill it via management plane (STONITH — shoot the other node in the head), or use leases so leadership simply expires.

Because getting all three right requires consensus, the credible Postgres HA tooling outsources the decision: Patroni stores leadership state in etcd or Consul (a Raft-backed store) so there can only be one leader on record, and the Postgres processes follow that record. MongoDB builds the same machinery into the replica set itself. The pattern is identical either way: replication ships the data, but failover safety comes from a consensus layer deciding who the leader is.

Further reading

Found this useful?