Sharding

Replication puts the same data on more machines; sharding puts different data on more machines. It's how a database outgrows one box's write throughput and one box's disk — and it's a one-way door. Once the rows are split by a key, every query either carries that key and goes to one shard, or doesn't and goes to all of them. Most of sharding practice is living with that sentence.

The setup

A single Postgres or MySQL primary takes you a long way — further than most teams expect, especially with read replicas absorbing the read traffic. But replication doesn't scale writes: every replica replays every write, so the write ceiling is still one machine's WAL, one machine's buffer pool, one machine's disks. When the working set outgrows RAM, or write volume outgrows one box, or a single table's indexes get too deep to maintain, the remaining move is to split the data itself. Each shard is an ordinary database holding a slice of the rows, usually with its own replica set behind it for HA — sharding and replication compose, they don't compete.

Two framings of the same idea, two pages on this site: the distributed-systems topic treats partitioning abstractly (it's the same mechanism inside Kafka, DynamoDB, and every distributed cache), and ELI5: sharding is the no-jargon version. This page is the database-operator's view: the strategies, the systems that implement them, and the three problems that take up all the time.

Everything downstream is determined by one decision: the shard key. The key decides which queries are single-shard (fast, scalable) and which are scatter-gather (touch every shard, scale with the slowest one). It decides whether load spreads evenly. And it's nearly immutable — changing it later means rewriting every row's placement. A good key has high cardinality, spreads writes evenly, and appears in the WHERE clause of the queries you care about. user_id or tenant_id usually qualifies; created_at usually doesn't.

Four ways to split a keyspace

Range. Each shard owns a contiguous interval of the key: [a, f), [f, m), and so on. Range scans stay on one shard and shards can split exactly where the data is dense — Bigtable's tablets, HBase regions, and the per-range Raft groups in CockroachDB and friends all work this way. The failure mode is built in: a monotonically increasing key (timestamps, sequential IDs) sends every insert to the last range, so one shard does all the writing while the rest watch.
Hash. Shard by hash(key). Adjacent keys scatter, so load spreads evenly and the monotonic-insert problem disappears — and so do range scans, which now touch every shard. The naive form, hash(key) mod N, has a trap: changing N remaps almost every key. Consistent hashing fixes that by making each node's arrival or departure move only its own slice (Cassandra's token ring with vnodes), and the pre-split form — fix a large number of hash buckets up front, assign buckets to shards — gets the same effect with simpler bookkeeping.
Directory. Don't compute the placement, store it: a lookup table maps each key (or each tenant) to its shard. Maximum flexibility — you can move one hot customer to their own hardware, place a regulated tenant in a specific region, rebalance one row at a time. The cost is that the directory is now on the path of every query: it has to be cached aggressively, replicated, and treated as tier-zero infrastructure. This is the standard shape for B2B SaaS sharded by tenant.
Geo. Partition by where the data legally or physically belongs: EU rows on EU shards. Sometimes this is latency engineering, increasingly it's compliance (data residency), and it composes with the other schemes — geo picks the region, hash or range places the row within it. Spanner's placement policies and CockroachDB's regional tables are this idea built into the database.

Three systems worth knowing

Vitess: sharded MySQL with a routing layer. Built at YouTube to scale MySQL, now a CNCF project and the engine under PlanetScale. The application talks to vtgate, a stateless proxy that speaks the MySQL protocol; behind it, each shard is a normal MySQL replica set managed by a vttablet sidecar. Routing is driven by vindexes — declared functions from column values to keyspace IDs, so the query layer can tell which shards a query touches and prune the rest. The headline feature is resharding via VReplication: new shards are populated by streaming the binlog from the old ones, traffic is cut over when they've caught up, and the switch is reversible until you commit to it. Slack, GitHub, and Square have all written publicly about running it.

Citus: sharding inside Postgres. A Postgres extension (now owned by Microsoft, open source) rather than a proxy. You pick a distribution column and call create_distributed_table(); the coordinator keeps metadata and the rows live in many shards spread across worker nodes. The planner intercepts queries: a query filtered on the distribution column routes to one worker, an analytical query fans out and aggregates partial results. Two ideas do most of the work in practice: reference tables (small lookup tables copied in full to every worker, so joins against them stay local) and co-location (tables distributed by the same column have matching rows on the same worker, so tenant-scoped joins never cross the network). The natural fit is multi-tenant SaaS sharded by tenant_id.

MongoDB: chunks and a balancer. The built-in version of the same architecture: mongos routers in front, config servers holding the metadata, and each shard a replica set. A sharded collection is divided into chunks — contiguous shard-key ranges — and a background balancer migrates chunks between shards when the distribution gets uneven. The shard key supports hashed or ranged flavours, and its historical rigidity has softened: recent versions can refine a key (add suffix fields) and reshard a collection in place. The operational lesson MongoDB's docs spend the most ink on is the same one as everywhere else: a bad shard key, not a bad balancer, is what sinks deployments.

System	Unit of placement	Routing	Resharding
Vitess	Keyspace-ID ranges over MySQL shards	`vtgate` proxy + vindexes	VReplication: binlog copy, verified cutover
Citus	Hash buckets of the distribution column	Postgres planner on the coordinator	Online shard rebalancer moves shards between workers
MongoDB	Chunks (shard-key ranges)	`mongos` + config servers	Automatic balancer; in-place resharding in recent versions
App-level (directory)	Whatever the lookup table says	Your code + a cached directory service	Yours to build — usually dual-write + backfill + cutover

Hard problem one: the hot shard

Hashing spreads keys evenly, not load. If one key is orders of magnitude hotter than the rest — the celebrity account, the viral post, the one enterprise tenant who is half your traffic — its shard runs hot no matter how the others are balanced, and adding shards doesn't help because the hot key still lives on exactly one of them.

Pick a key that can't concentrate. The cheapest fix is upstream: shard messages by (channel_id) and one giant channel melts a shard; shard by (channel_id, bucket) and it can't.
Salt the hot keys. Append a small random suffix to the hottest keys so their writes spread over several shards, at the price of fanning out reads for those keys to reassemble them. Usually applied only to detected-hot keys, not globally.
Cache in front. A read-hot key is often better absorbed by a cache tier than by resharding the database under it.
Isolate. With a directory scheme, move the hot tenant to dedicated hardware. This is the move hash sharding can't make and directory sharding makes trivially.

Skew is the default, not the exception. Real-world key popularity is heavy-tailed — a few keys dominate. Design reviews for sharded systems should ask "what happens when one key is most of the load" the way they ask "what happens when a node dies", because both are certainties, not edge cases.

Hard problem two: resharding

However you split today, the split is wrong eventually: a shard outgrows its hardware, the shard count needs to double, the key choice itself was a mistake. Resharding means moving a large fraction of the data while serving full traffic, and it's where the scheme you chose either pays off or collects its debt. mod N hashing collects hardest — bumping N reshuffles nearly everything — which is why consistent hashing and pre-split bucket schemes exist: they bound how much data any topology change moves.

The online migration itself has a standard shape, whatever the tooling: start copying the existing rows to the new layout; tail the source's replication log to apply the writes that happen during the copy; verify the two layouts match with checksums; then cut reads and writes over, ideally with a way back. Vitess' VReplication is this shape productised, MongoDB's balancer does it chunk by chunk continuously, and teams sharding a previously-unsharded Postgres by hand (Notion's migration write-up is the canonical tour) build the same pipeline out of logical replication and audit scripts. The cutover is the dangerous moment; everything before it should be boring, observable, and abortable.

Hard problem three: cross-shard transactions

Within one shard you still have everything the engine ever gave you — MVCC, real isolation levels, atomic multi-row commits. Across shards you have none of it by default. A transfer between two accounts on different shards is two databases that each see only half the transaction; making both halves commit or neither is two-phase commit, with its costs: an extra round of fsyncs and round trips, locks held across the network, and a coordinator whose crash leaves participants blocked mid-protocol. Atomicity is only half the problem, too — a reader scanning both shards between the two commits sees money in flight, and getting cross-shard isolation on top of cross-shard atomicity needs distributed snapshots or global ordering, which is the territory of the distributed SQL engines.

Sharded-but-not-distributed systems mostly route around the problem instead of solving it. The shard key is chosen so the transactions that matter are single-shard — co-locating a tenant's rows means a tenant's transactions are local, which is most of why Citus co-location and Vitess vindex design get so much attention. What can't be co-located gets eventual consistency on purpose: sagas with compensating actions, or outbox events a consumer applies to the other shard, with idempotent handlers because the events will be retried. Secondary indexes face the same fork: a local index (per shard) keeps writes single-shard but turns lookups on that column into scatter-gather; a global index (itself sharded by the indexed value) makes lookups one hop but makes every write a cross-shard write. There's no third option; there's only choosing which queries pay.

Sharding

The setup

Four ways to split a keyspace

Three systems worth knowing

Hard problem one: the hot shard

Hard problem two: resharding

Hard problem three: cross-shard transactions

Further reading

09 — Distributed SQL