B-tree vs LSM-tree.
Filing each card neatly in place as it arrives, versus jotting everything in order and tidying up in batches later.
Underneath every database is a storage engine that decides how data lands on disk, and there are two main philosophies. A B-tree files each new record straight into its correct sorted place, updating the structure on disk right away — like a librarian re-shelving every returned book the instant it comes back.
An LSM-tree does the opposite. New writes are jotted quickly into memory and an append-only log, then flushed to disk in sorted batches, and tidied up (merged) in the background. It is like dumping returned books on a cart and re-shelving them in efficient sweeps later.
- 1
A B-tree is a librarian who re-shelves every returned book the instant it comes back.
- Back and forth all day.2
Keeping the shelves sorted means lots of walking around — scattered, random writes.
- Right where it should be.3
The payoff: reads are direct. Walk straight to the exact shelf and grab it.
- On the cart, sort it later.4
An LSM-tree dumps returns on a cart and jots them down fast — quick, sequential writes.
- Now where did I put it…5
A reader may now check the cart and several shelves — so reads can touch more places.
- 6
In the background, compaction sweeps the carts onto sorted shelves and drops stale copies.
The write-vs-read trade
A B-tree pays at write time: keeping the on-disk structure sorted means scattered, random writes, which disks dislike. In return, reads are simple and fast — walk the tree straight to the row.
An LSM-tree pays at read time: because data is spread across an in-memory buffer and several sorted files on disk, a read may have to check several places (helped by Bloom filters). In return, writes are fast and sequential, which is why write-heavy systems favour it.
Compaction: the background tidy-up
An LSM-tree keeps producing new sorted files, so it runs compaction in the background: merging files, dropping deleted and overwritten entries, and keeping the number of files a reader must check down. This is what tames read cost over time, but it consumes disk I/O and CPU, and it means the same data may be rewritten several times (write amplification). Choosing between the two engines is largely about whether your workload is write-heavy or read-heavy.