VI · Graphs & sets

LSM-tree

Append-only, merged in the background.

The story

Patrick O'Neil et al noticed that disk is fast at sequential writes and slow at random ones. Their LSM-tree (Log-Structured Merge) writes everything sequentially to a small in-memory buffer, flushes it sorted to disk, and merges those sorted runs in the background. The pattern powers every modern key-value store.

How it works

In-memory memtable buffers writes. When full, it's flushed to a sorted SSTable file. SSTables are merged in tiered or levelled compactions to limit the number of files a read must check. Bloom filters in front of each SSTable cut the read amplification.

Where it lives

RocksDB (and everyone using it: MyRocks, CockroachDB, TiKV, Yugabyte). LevelDB. Cassandra. ScyllaDB. HBase. BigTable. ClickHouse. Pebble.

The key insight

LSM-trees trade write amplification (10–30× the logical write) for sequential I/O — the trade-off SSDs love. Read amplification (checking many levels) is the cost; bloom filters and compactions are how you manage it.