ELI5 · Databases & storage

B-tree vs LSM-tree.

Filing each card neatly in place as it arrives, versus jotting everything in order and tidying up in batches later.

Underneath every database is a storage engine that decides how data lands on disk, and there are two main philosophies. A B-tree files each new record straight into its correct sorted place, updating the structure on disk right away — like a librarian re-shelving every returned book the instant it comes back.

An LSM-tree does the opposite. New writes are jotted quickly into memory and an append-only log, then flushed to disk in sorted batches, and tidied up (merged) in the background. It is like dumping returned books on a cart and re-shelving them in efficient sweeps later.

  1. 1

    A B-tree is a librarian who re-shelves every returned book the instant it comes back.

  2. Back and forth all day.
    scattered, random writes
    2

    Keeping the shelves sorted means lots of walking around — scattered, random writes.

  3. Right where it should be.
    straight to the spot
    3

    The payoff: reads are direct. Walk straight to the exact shelf and grab it.

  4. On the cart, sort it later.
    on the cart
    4

    An LSM-tree dumps returns on a cart and jots them down fast — quick, sequential writes.

  5. Now where did I put it…
    cart file 1 file 2 check several
    5

    A reader may now check the cart and several shelves — so reads can touch more places.

  6. merge sorted
    6

    In the background, compaction sweeps the carts onto sorted shelves and drops stale copies.

Re-shelving every book the instant it returns, versus a cart and batch sweeps later. (Bit runs the library.)

The write-vs-read trade

A B-tree pays at write time: keeping the on-disk structure sorted means scattered, random writes, which disks dislike. In return, reads are simple and fast — walk the tree straight to the row.

An LSM-tree pays at read time: because data is spread across an in-memory buffer and several sorted files on disk, a read may have to check several places (helped by Bloom filters). In return, writes are fast and sequential, which is why write-heavy systems favour it.

Compaction: the background tidy-up

An LSM-tree keeps producing new sorted files, so it runs compaction in the background: merging files, dropping deleted and overwritten entries, and keeping the number of files a reader must check down. This is what tames read cost over time, but it consumes disk I/O and CPU, and it means the same data may be rewritten several times (write amplification). Choosing between the two engines is largely about whether your workload is write-heavy or read-heavy.

The real version LSM-tree simulator →
Found this useful?