Should you shard your database?

Q: Should you shard your database?

Not until one machine cannot keep up. Exhaust read replicas, indexes, caching, and a bigger box first. Sharding is the heaviest, most irreversible tool in the drawer. Sharding splits one logical database into many physical ones, each holding a slice of the rows, so no single machine has to hold all the data or absorb all the writes. It is the heaviest, most irreversible infrastructure work most teams ever do. The right time to reach for it is "we have run out of every cheaper option," not "we expect to grow."

Sharding solves exactly one problem

Sharding splits one logical database into many physical ones, each holding a slice of the rows, so no single machine has to hold all the data or absorb all the writes. It is the heaviest, most irreversible infrastructure work most teams ever do. The right time to reach for it is "we have run out of every cheaper option," not "we expect to grow."

The thing to internalise: sharding fixes a single primary node that cannot keep up with writes or cannot hold the dataset. That is the whole job. If that is not your problem, sharding will not help, and it will hand you a pile of new ones. So the first task is being precise about what is actually maxed out.

When it is time to shard

The honest signal is narrow. A single write primary saturated on throughput, or a dataset that no longer fits on the largest node you can buy — and a growth curve confirming one node will never be enough. You have already added replicas, indexes, and a cache, and bought a bigger box, and writes are still the wall. At that point sharding stops being premature and starts being necessary.

You also want a shard key that spreads both data and traffic evenly, where most queries can be answered from one shard. If your access pattern gives you that, sharding scales close to linearly. If it does not, fix the cheaper layers and wait.

When to exhaust the cheap runway first

Most "the database is slow" problems are reads, and reads have easy answers. Add a couple of indexes — a missing one is the difference between a sequential scan and a millisecond lookup. Put a cache in front of the hot queries. Add read replicas so reads spread across copies while the primary handles writes. That combination alone carries an enormous number of products their entire life.

Then the boring, underrated move: buy a bigger machine. Vertical scaling has a ceiling, but it is high — modern instances offer dozens of cores and hundreds of gigabytes of memory, and doubling your box is an afternoon versus a quarter of engineering. Skip sharding, too, when your queries constantly span all the data: global counts and cross-entity joins get slower and harder once the rows are scattered across shards.

What sharding actually costs: the shard key

Once you commit, everything hinges on the shard key — the field that decides which shard a row lives on. The failure mode is the hot shard. Shard by customer when one enterprise customer drives 40% of your traffic, and that customer's shard melts while the others sit idle. You scaled horizontally and still have a single overloaded node, only now it is harder to reason about.

The other failure is the fan-out query. If the key does not match how you read the data, common queries hit every shard and merge the results — slower and more fragile than the single box you started with. The cruelest part is that the key is effectively permanent. Change it once billions of rows are placed and you are resharding: physically moving data across shards while the system stays live, without dropping writes. Teams call that a multi-month project for a reason.

The cost math: count the machines

Sharding multiplies your instance count faster than your capacity. Each shard needs its own primary plus its own replicas, its own backups, its own monitoring — the full availability stack, per shard. Go from one shard to two and you have roughly doubled the fixed costs while each shard starts out half empty, so you are paying for headroom on every shard at once. Skew makes it worse: shards are usually sized uniformly, which means you provision all of them for the hottest one, and the cold shards carry the same bill at a fraction of the work.

Compare that honestly with the bigger box. Cloud pricing per core stays close to linear most of the way up the size chart, so doubling the instance roughly doubles the cost — about what a second shard costs, minus the engineer-months of migration, the routing layer, and the permanent operational surface. The migration is the real line item: a multi-month project of senior time costs more than years of running the larger instance.

The rule: price the next two instance sizes up before you price shards. If a bigger box exists and buys you a year, it is almost always cheaper than sharding, even when the sticker price looks steep. Sharding wins on cost only when you have run out of chart — no bigger box to buy — and at that point you are not choosing it to save money; you are choosing it because nothing else works.

The trap: sharding early to be safe

"Shard early so we never have to scramble" is exactly backwards. Sharding early means choosing the key when you understand your access patterns least, which is precisely when you are most likely to choose wrong. The resharding pain is really the cost of a bad key, paid much later when it is far more expensive to fix.

So the runway a bigger instance buys is almost always cheaper than the engineer-months sharding costs, and the extra time teaches you what your real query mix and skew look like. The longer you can responsibly wait, the better your key will be when you finally pick it.

Sharding vs read replicas (and a bigger box)

These solve different walls, and people conflate them. Read replicas scale reads: copies of the data serve queries while the primary takes writes. A bigger box scales everything vertically until you hit the hardware ceiling. Both are cheap, reversible, and an afternoon of work. Sharding is the only one that scales writes past a single primary — and it is expensive, near-permanent, and a real distributed system to operate.

So the order is fixed. Replicas and a cache for read load, a bigger instance for headroom, and sharding only when writes specifically saturate the largest single primary you can buy. If you cannot point at saturated writes, you want one of the first three, not the last one.

How to shard without regret

Pick a key that matches your dominant access pattern and spreads load, and validate it against your real query mix and your skew before committing, not after. The validation is the work; the migration is the easy part by comparison.

Lean on managed machinery rather than hand-rolling routing — a database that shards for you, or a layer like Vitess — because the operational surface is enormous. And migrate incrementally: move one slice, prove it, then continue, rather than flipping everything at once. Sharding is sometimes the right answer. It is just almost never the early one.