IV · Probabilistic structures

HyperLogLog

Count distinct, in 12 KB.

2007 · Flajolet, Fusy, Gandouet, Meunier · INRIA

The story

Counting unique elements in a stream of billions with sub-percent error in 12 KB of memory. The trick: hash each element; for each, count the leading zeros in the hash. The maximum leading-zero count seen is a stochastic estimator of the cardinality.

How it works

Partition the hash output into m buckets. For each bucket, track the maximum leading-zero count of all hashes that landed in it. The harmonic mean of 2^count across buckets, with bias correction, estimates the cardinality. Standard error ≈ 1.04/√m.

Where it lives

Redis (PFCOUNT, PFADD). BigQuery APPROX_COUNT_DISTINCT. Druid uniques. Snowflake APPROX_COUNT_DISTINCT. Every "unique users" metric at internet scale.

The key insight

Mergeable: HLL of region A union HLL of region B = HLL of A∪B, with no recomputation. That property is why HLLs power dashboards across sharded data — each shard returns its sketch, the gateway merges.

add	O(1)
cardinality	O(m)
merge	O(m)