Sharding
Replication puts the same data on more machines; sharding puts different data on more machines. It's how a database outgrows one box's write throughput and one box's disk — and it's a one-way door. Once the rows are split by a key, every query either carries that key and goes to one shard, or doesn't and goes to all of them. Most of sharding practice is living with that sentence.
The setup
A single Postgres or MySQL primary takes you a long way — further than most teams expect, especially with read replicas absorbing the read traffic. But replication doesn't scale writes: every replica replays every write, so the write ceiling is still one machine's WAL, one machine's buffer pool, one machine's disks. When the working set outgrows RAM, or write volume outgrows one box, or a single table's indexes get too deep to maintain, the remaining move is to split the data itself. Each shard is an ordinary database holding a slice of the rows, usually with its own replica set behind it for HA — sharding and replication compose, they don't compete.
Two framings of the same idea, two pages on this site: the distributed-systems topic treats partitioning abstractly (it's the same mechanism inside Kafka, DynamoDB, and every distributed cache), and ELI5: sharding is the no-jargon version. This page is the database-operator's view: the strategies, the systems that implement them, and the three problems that take up all the time.
Everything downstream is determined by one decision: the shard key. The
key decides which queries are single-shard (fast, scalable) and which are scatter-gather
(touch every shard, scale with the slowest one). It decides whether load spreads evenly.
And it's nearly immutable — changing it later means rewriting every row's placement.
A good key has high cardinality, spreads writes evenly, and appears in the
WHERE clause of the queries you care about. user_id or
tenant_id usually qualifies; created_at usually doesn't.
Four ways to split a keyspace
- Range. Each shard owns a contiguous interval of the key:
[a, f),[f, m), and so on. Range scans stay on one shard and shards can split exactly where the data is dense — Bigtable's tablets, HBase regions, and the per-range Raft groups in CockroachDB and friends all work this way. The failure mode is built in: a monotonically increasing key (timestamps, sequential IDs) sends every insert to the last range, so one shard does all the writing while the rest watch. - Hash. Shard by
hash(key). Adjacent keys scatter, so load spreads evenly and the monotonic-insert problem disappears — and so do range scans, which now touch every shard. The naive form,hash(key) mod N, has a trap: changing N remaps almost every key. Consistent hashing fixes that by making each node's arrival or departure move only its own slice (Cassandra's token ring with vnodes), and the pre-split form — fix a large number of hash buckets up front, assign buckets to shards — gets the same effect with simpler bookkeeping. - Directory. Don't compute the placement, store it: a lookup table maps each key (or each tenant) to its shard. Maximum flexibility — you can move one hot customer to their own hardware, place a regulated tenant in a specific region, rebalance one row at a time. The cost is that the directory is now on the path of every query: it has to be cached aggressively, replicated, and treated as tier-zero infrastructure. This is the standard shape for B2B SaaS sharded by tenant.
- Geo. Partition by where the data legally or physically belongs: EU rows on EU shards. Sometimes this is latency engineering, increasingly it's compliance (data residency), and it composes with the other schemes — geo picks the region, hash or range places the row within it. Spanner's placement policies and CockroachDB's regional tables are this idea built into the database.
Three systems worth knowing
Vitess: sharded MySQL with a routing layer. Built at YouTube to scale
MySQL, now a CNCF project and the engine under PlanetScale. The application talks to
vtgate, a stateless proxy that speaks the MySQL protocol; behind it, each
shard is a normal MySQL replica set managed by a vttablet sidecar.
Routing is driven by vindexes — declared functions from column values
to keyspace IDs, so the query layer can tell which shards a query touches and prune the
rest. The headline feature is resharding via VReplication: new shards
are populated by streaming the binlog from the old ones, traffic is cut over when
they've caught up, and the switch is reversible until you commit to it. Slack, GitHub,
and Square have all written publicly about running it.
Citus: sharding inside Postgres. A Postgres extension (now owned by
Microsoft, open source) rather than a proxy. You pick a distribution column and call
create_distributed_table(); the coordinator keeps metadata and the rows
live in many shards spread across worker nodes. The planner intercepts queries: a query
filtered on the distribution column routes to one worker, an analytical query fans out
and aggregates partial results. Two ideas do most of the work in practice:
reference tables (small lookup tables copied in full to every worker,
so joins against them stay local) and co-location (tables distributed
by the same column have matching rows on the same worker, so tenant-scoped joins never
cross the network). The natural fit is multi-tenant SaaS sharded by
tenant_id.
MongoDB: chunks and a balancer. The built-in version of the same
architecture: mongos routers in front, config servers holding the
metadata, and each shard a replica set. A sharded collection is divided into
chunks — contiguous shard-key ranges — and a background balancer
migrates chunks between shards when the distribution gets uneven. The shard key
supports hashed or ranged flavours, and its historical rigidity has softened: recent
versions can refine a key (add suffix fields) and reshard a collection in place. The
operational lesson MongoDB's docs spend the most ink on is the same one as everywhere
else: a bad shard key, not a bad balancer, is what sinks deployments.
| System | Unit of placement | Routing | Resharding |
|---|---|---|---|
| Vitess | Keyspace-ID ranges over MySQL shards | vtgate proxy + vindexes | VReplication: binlog copy, verified cutover |
| Citus | Hash buckets of the distribution column | Postgres planner on the coordinator | Online shard rebalancer moves shards between workers |
| MongoDB | Chunks (shard-key ranges) | mongos + config servers | Automatic balancer; in-place resharding in recent versions |
| App-level (directory) | Whatever the lookup table says | Your code + a cached directory service | Yours to build — usually dual-write + backfill + cutover |
Hard problem one: the hot shard
Hashing spreads keys evenly, not load. If one key is orders of magnitude hotter than the rest — the celebrity account, the viral post, the one enterprise tenant who is half your traffic — its shard runs hot no matter how the others are balanced, and adding shards doesn't help because the hot key still lives on exactly one of them.
- Pick a key that can't concentrate. The cheapest fix is upstream:
shard messages by
(channel_id)and one giant channel melts a shard; shard by(channel_id, bucket)and it can't. - Salt the hot keys. Append a small random suffix to the hottest keys so their writes spread over several shards, at the price of fanning out reads for those keys to reassemble them. Usually applied only to detected-hot keys, not globally.
- Cache in front. A read-hot key is often better absorbed by a cache tier than by resharding the database under it.
- Isolate. With a directory scheme, move the hot tenant to dedicated hardware. This is the move hash sharding can't make and directory sharding makes trivially.
Hard problem two: resharding
However you split today, the split is wrong eventually: a shard outgrows its hardware,
the shard count needs to double, the key choice itself was a mistake. Resharding means
moving a large fraction of the data while serving full traffic, and it's where the
scheme you chose either pays off or collects its debt. mod N hashing
collects hardest — bumping N reshuffles nearly everything — which is why consistent
hashing and pre-split bucket schemes exist: they bound how much data any topology change
moves.
The online migration itself has a standard shape, whatever the tooling: start copying the existing rows to the new layout; tail the source's replication log to apply the writes that happen during the copy; verify the two layouts match with checksums; then cut reads and writes over, ideally with a way back. Vitess' VReplication is this shape productised, MongoDB's balancer does it chunk by chunk continuously, and teams sharding a previously-unsharded Postgres by hand (Notion's migration write-up is the canonical tour) build the same pipeline out of logical replication and audit scripts. The cutover is the dangerous moment; everything before it should be boring, observable, and abortable.
Hard problem three: cross-shard transactions
Within one shard you still have everything the engine ever gave you — MVCC, real isolation levels, atomic multi-row commits. Across shards you have none of it by default. A transfer between two accounts on different shards is two databases that each see only half the transaction; making both halves commit or neither is two-phase commit, with its costs: an extra round of fsyncs and round trips, locks held across the network, and a coordinator whose crash leaves participants blocked mid-protocol. Atomicity is only half the problem, too — a reader scanning both shards between the two commits sees money in flight, and getting cross-shard isolation on top of cross-shard atomicity needs distributed snapshots or global ordering, which is the territory of the distributed SQL engines.
Sharded-but-not-distributed systems mostly route around the problem instead of solving it. The shard key is chosen so the transactions that matter are single-shard — co-locating a tenant's rows means a tenant's transactions are local, which is most of why Citus co-location and Vitess vindex design get so much attention. What can't be co-located gets eventual consistency on purpose: sagas with compensating actions, or outbox events a consumer applies to the other shard, with idempotent handlers because the events will be retried. Secondary indexes face the same fork: a local index (per shard) keeps writes single-shard but turns lookups on that column into scatter-gather; a global index (itself sharded by the indexed value) makes lookups one hop but makes every write a cross-shard write. There's no third option; there's only choosing which queries pay.
Further reading
- Vitess docs Vindexes, VReplication, and the resharding workflow — the best-documented routing layer in the business.
- Citus docs Distributed tables, reference tables, co-location, and the multi-tenant design guide.
- MongoDB docs — Sharding Shard keys, chunks, the balancer, and the operational caveats around key choice.
- Notion — Sharding Postgres at Notion A complete, honest write-up of sharding a live Postgres by hand: key choice, double-writes, verification, cutover.
- Kleppmann — DDIA, chapter 6 Partitioning strategies, rebalancing, secondary indexes, and request routing, in one chapter.
- Semicolony — Sharding (distributed systems) The abstract treatment: the same partitioning machinery across databases, queues, and caches.
- Semicolony — Database sharding simulator Watch keys route, shards heat up, and resharding move data, interactively.