ELI5 · Databases & storage

Bloom filters.

A bouncer’s “definitely not on the list” memory: tiny, fast, and never wrongly says no.

A bouncer with a thousand-name guest list cannot carry the whole book. So imagine a tiny grid of light switches instead. Each name, run through a few stamps, flips a fixed set of switches on. To check a name, you stamp it and look: if any of its switches is off, that name was definitely never added.

A Bloom filter is exactly that trick. It answers “is this maybe in the set?” using a fraction of the memory a real list would need. It can occasionally say “maybe” for something that was never added, but it will never say “no” for something that was.

  1. All of it? No way.
    1

    A full guest list is huge — too big to carry or check quickly.

  2. just a few bits 000000
    2

    Keep a tiny grid of on/off switches instead — a few bits, not the whole book.

  3. "bit" stamps → 101001
    3

    To add a name, run it through a few stamps and flip those switches on.

  4. Not on the list.
    check "zed" 101 a 0 → definitely not in
    4

    To check: stamp the name. Any switch off? It was definitely never added.

  5. check "ben" 111 all 1 → maybe (could collide)
    5

    All switches on? It is probably in — but switches can collide, so “maybe.”

  6. 101 skip slow disk
    6

    Tiny and fast, never a false “no” — perfect for skipping expensive lookups.

A few bits and a few stamps stand in for a list too big to carry.

No false negatives, some false positives

The guarantee is one-directional. Because adding a name only ever turns switches on, a name that was truly added will always find all its switches on — so “no” is always trustworthy. But two different names can happen to share switches, so an item that was never added might find all of its switches already on by coincidence. That is a false positive: a “maybe” that turns out wrong.

Why that is still useful

Most real questions are “is this worth a slow lookup?” A database can keep a Bloom filter in memory and check it first: if it says “definitely not here,” the database skips a costly disk read entirely. The occasional false “maybe” just means a wasted check now and then — a tiny price for avoiding the expensive lookup the vast majority of the time.

The real version Bloom filter simulator →
Found this useful?