IV · Probabilistic structures

Bloom filter

A set, in a thousandth of the space.

The story

Bloom's paper "Space/Time Trade-offs in Hash Coding with Allowable Errors" proposed a tiny structure that says, with certainty, "definitely not in the set" or "probably in the set". The trick: use a few hash functions to set bits in an array; check by checking if all those bits are set. False positives are tunable; false negatives are impossible.

How it works

Bit array of size m; k hash functions. Insert: set the k bits given by hashing the key. Lookup: check if all k bits are set. False-positive rate ≈ (1 − e^(−kn/m))^k. Optimal k = (m/n) ln 2.

Where it lives

Cassandra row caches. Bitcoin SPV nodes. Chrome's URL safe-browsing list. Bigtable / LevelDB / RocksDB SSTable filters — checking if a key might be in a sorted file before opening it. Every modern LSM-based KV store.

The key insight

Bloom filters live in front of expensive lookups. The 1% false-positive cost is paid in vain disk reads; the 99% true-negative saves them. The structure exists because disk is slow.