IV · Probabilistic structures

MinHash

Set similarity, in a few hashes.

The story

Broder needed to detect near-duplicate web pages at AltaVista. Computing Jaccard similarity (|A∩B|/|A∪B|) for billions of pages was infeasible. His insight: hash each set's elements with k different functions; keep the minimum hash from each. The fraction of matching minimums is an unbiased estimator of Jaccard similarity.

How it works

Pick k hash functions. For each set, compute h(x) for every element x and keep the minimum: that's the set's "signature" under that function. Across k functions, you get a k-dimensional signature. Comparing two signatures: count matching positions, divide by k.

Where it lives

Web deduplication (Google, Bing). Netflix similarity recommendations. datasketches Apache library. Locality-Sensitive Hashing pipelines.

The key insight

MinHash + Locality-Sensitive Hashing turns "find all pairs with Jaccard > 0.7" from O(n²) into O(n) — feasible at web scale, infeasible without.