02 / 05

AI systems / 02

Embeddings & vector search

An embedding turns a piece of text into a point in space, placed so that things which mean similar things sit close together. Once meaning is geometry, "find related documents" stops being a keyword-matching problem and becomes "find the nearest points." The only thing left to solve is how to do that fast over hundreds of millions of vectors — which is where the real engineering, and this page, lives.

What an embedding is

An embedding is a fixed-length list of numbers — a vector — that represents a piece of content. A modern text embedding model takes a sentence or a paragraph and returns, say, 768 or 1,536 numbers. On their own those numbers mean nothing. What matters is their position relative to other vectors: the model is trained so that text with similar meaning produces vectors that are close together, and unrelated text produces vectors that are far apart. The dimensionality is a design choice — more dimensions can capture finer distinctions but cost more to store and compare, so production models cluster around a few hundred to a couple of thousand.

This is the same idea as the token embeddings inside an LLM, scaled up to whole passages. "How do I reset my password" and "I forgot my login" share almost no words but land near each other, because the model learned that they are used in the same situations. That is the entire trick: meaning becomes distance. Search, clustering, deduplication, recommendation, and the retrieval half of RAG are all the same operation underneath — measure distance between vectors — applied to different problems.

How embeddings are produced

An embedding model is usually a transformer, often a smaller relative of a generative model, trained specifically so that its output vector captures the meaning of the whole input rather than predicting the next token. The common training recipe is contrastive: show the model pairs of texts that should be close (a question and its answer, a sentence and its paraphrase) and pairs that should be far apart, and adjust it until similar pairs land near each other and dissimilar pairs spread out. Because the model is producing one vector for a whole passage, the choice of model matters: a model trained on short search queries behaves differently from one trained on long documents or on code.

Two practical rules follow. First, you must embed your stored documents and your queries with the same model, because two models produce coordinates in incompatible spaces — a vector from one is meaningless to another. Second, changing the embedding model later means re-embedding everything you have stored, which is a real migration cost worth planning for up front. Pick a model, measure it on your own data, and treat a switch as a reindex, not a config change.

Choosing an embedding model

The model is the single biggest lever on quality, and a few dimensions of choice matter in practice. Vector size trades accuracy for cost: larger vectors can encode finer distinctions but cost proportionally more to store and compare, and many modern models support shortening their output so you can dial the size down when storage dominates. Maximum input length decides how much text one vector can faithfully represent; a model capped at 512 tokens will quietly truncate a long document and embed only the start. Domain matters more than people expect — a model trained on web prose may do poorly on source code, legal text, or biomedical terms, where a specialised model wins outright.

Then there is the build-versus-buy axis. A hosted embedding API is the fastest start and offloads the GPUs, but every document and query leaves your network and you pay per call forever. A self-hosted open model keeps data in-house and makes re-embedding cheap once you own the hardware, at the cost of running it. Public leaderboards such as MTEB are a reasonable starting shortlist, but the only benchmark that counts is your own data: take a few hundred real queries with known good answers, embed with two or three candidate models, and measure which retrieves the right documents. Treat the result as a decision you will live with, because, as noted above, changing models later means re-embedding everything.

Measuring closeness

Two vectors are "close" if the angle between them is small, which is what cosine similarity measures. In practice teams normalise every vector to unit length and then a plain dot product gives the same ranking as cosine, which is cheaper to compute. Euclidean (straight-line) distance is also used, and on normalised vectors it ranks results identically to cosine. The headline point is that similarity is just arithmetic on the numbers — no understanding is needed at query time, which is exactly why it is fast enough to serve.

Similarity is the angle between vectors. Normalise to unit length and a dot product ranks results the same way, faster.

Embeddings are not keywords. Keyword search finds documents that contain your words. Vector search finds documents that mean the same thing, even with zero shared words. That is its strength and its weakness: it nails paraphrase and synonyms, but can miss an exact code, SKU, or error string. The fix is usually hybrid search — run both keyword and vector search and combine the scores — which is why most serious systems are hybrid, not pure vector.

The scaling problem

Finding the closest vector to a query is easy if you compare against every stored vector: compute the distance to all of them and keep the smallest. This brute-force scan is exact, and it is completely fine up to a few hundred thousand vectors. The trouble is that it is linear in the number of vectors and in the dimensionality. At a hundred million documents with 1,000-dim vectors, every query touches a hundred billion numbers — far too slow and too expensive to serve interactively, and it gets worse with every document you add.

Brute force is exact and linear. ANN trades a small, tunable chance of a miss for examining only a fraction of the data.

So production systems give up a little accuracy for a lot of speed. Approximate nearest-neighbour (ANN) search returns vectors that are almost certainly among the closest, in roughly logarithmic time instead of linear, by building an index that lets the search skip the vast majority of candidates. You trade a small, tunable chance of missing the true best match for orders-of-magnitude faster queries. Every vector database is, at heart, an implementation of one or more ANN indexes plus the storage and filtering around them.

HNSW: the graph index

Hierarchical Navigable Small World graphs are the most common default, because they give excellent recall at low latency without much tuning. The idea borrows from how you would find a house in an unfamiliar country: start with a coarse map of major cities, hop to the nearest one, then zoom into a regional map, then a street map. HNSW builds exactly that as layers of a graph. The top layer has a few nodes with long-range links; each layer down is denser with shorter links; the bottom layer holds every vector.

HNSW search: enter at the sparse top layer, greedily hop toward the query, drop a layer, refine, repeat — touching a tiny fraction of nodes.

A search enters at the top, greedily moves to whichever neighbour is closest to the query, and when it cannot get closer it drops to the next layer down and continues. The long-range links up top cover huge distance in a few hops; the dense links at the bottom pin down the exact neighbourhood. Two knobs govern it: M, how many links each node keeps (higher means better recall and more memory), and ef_search, how many candidates to keep in play during a query (higher means better recall and slower queries). HNSW's costs are memory and slower inserts; its payoff is fast, high-recall search with little tuning, which is why pgvector, Qdrant, Weaviate, and Lucene all offer it.

IVF and quantization: the clustering family

The other big idea is to cluster the vectors first. An inverted file (IVF) index runs k-means once to carve the space into, say, a few thousand cells, each with a representative centroid. At query time you compare the query to the centroids, pick the nearest few cells, and only scan the vectors inside them. The knob is nprobe: how many cells to open. Open one and the search is blazing fast but may miss matches that fell just over a cell boundary; open more and recall climbs as latency rises.

Clustering pairs naturally with quantization, which compresses each vector so far more of the index fits in memory. Product quantization splits a vector into chunks and replaces each chunk with the id of the nearest entry in a small learned codebook, shrinking a vector from kilobytes to a few dozen bytes. Distances are then computed on the compressed codes. The trade is the same one the whole serving stack makes everywhere: accuracy for memory. IVF plus product quantization is the workhorse for very large, mostly static datasets where fitting the index in RAM is the binding constraint.

Family	Idea	Tune with	Best when
HNSW (graph)	Greedy hops down a multi-layer small-world graph	`M`, `ef_search`	default; high recall, low latency, frequent updates
IVF (clusters)	Scan only the nearest few cells	`nprobe`, cell count	huge, mostly-static datasets
+ Product quantization	Compress vectors to compact codes	code size	index must fit in limited RAM

You can build the HNSW intuition hands-on in the vector embedding simulator, and the same index types show up as a row in the database index types deep dive.

Recall, the metric that actually matters

Because ANN is approximate, the number you watch is recall: of the true nearest neighbours, what fraction did the index return? Every tuning knob trades recall against latency. Raise HNSW's ef_search or IVF's nprobe and the search examines more candidates, so recall climbs and queries slow down. The right operating point is workload specific — a recommendation feed can live happily at 0.9 recall, a legal-document search probably cannot — and you find it by measuring against a brute-force ground truth on a sample, not by guessing. A useful habit is to compute exact nearest neighbours for a few hundred queries offline, then tune the index until its recall against that ground truth clears your bar at acceptable latency.

Filtering, chunking, and where to store

Two operational details bite teams more than the index choice. The first is filtered search: "find similar documents where tenant_id = 42" is harder than it looks, because the index can either over-fetch and then filter (wasting work and risking too few results) or filter first and then search a thin set (wrecking recall). How a store handles filtered vector search is often the real deciding factor between products. The second is chunking: you rarely embed whole documents, you embed passages, and chunk size is a genuine trade-off — small chunks lose context, large chunks blur the meaning of the embedding. That choice belongs to the RAG pipeline and is covered there.

On storage, you do not always need a dedicated vector database. If you already run Postgres and your scale is moderate, the pgvector extension keeps vectors next to your relational data, so you can filter by ordinary columns and join in one query. A dedicated store (Pinecone, Qdrant, Weaviate, Milvus) earns its place when scale, filtering throughput, or zero-ops management justify a second system. The trade-offs are laid out in the pgvector vs Pinecone vs Weaviate comparison. The honest default for most teams is: start in the database you already run, and move to a dedicated store only when you can name the limit you have hit.

Why the geometry works at all

It is worth pausing on why turning meaning into distance is even possible. The model is never told a dictionary of meanings; it is only shown which texts go together. But "goes together" is a rich enough signal that, repeated over billions of examples, it forces a consistent geometry to emerge. If questions tend to sit near their answers, and paraphrases near their originals, and translations near their source, the only way to satisfy all those constraints at once is to arrange the space so that meaning itself becomes the organising axis. The directions that fall out — sentiment, topic, formality, tense — are side effects of solving the "what-goes-with-what" problem at scale, not features anyone designed.

This also explains the limits. The geometry only reflects the relationships present in the training data. A model that never saw your internal product names will scatter them more or less at random, which is why domain-specific search sometimes needs a fine-tuned embedding model rather than a general one. And high-dimensional spaces behave unintuitively: as dimensionality grows, distances between random points bunch together, so the gap between "the nearest" and "the tenth nearest" can be small. That is the curse of dimensionality, and it is part of why recall is something you measure rather than assume, and why re-ranking a handful of top candidates with a more careful model often pays off.

Beyond search: the other jobs embeddings do

Search is the headline use, but the same vectors quietly power several other features, and recognising them as one operation — measure distance — keeps your architecture simple. Clustering groups similar items by finding dense regions of the space, which is how you auto-group support tickets by theme or surface emerging topics in feedback. Deduplication flags near-identical content by looking for vectors that sit almost on top of each other, catching reposts and paraphrased copies that exact-match checks miss. Recommendation treats "more like this" as "nearest neighbours of the thing you just engaged with." Classification can be as simple as embedding a few labelled examples per class and assigning new items to whichever class centroid is closest, no separate model required.

The lesson is that an embedding model plus a vector index is a general-purpose similarity engine, not a search-only tool. Teams that internalise this tend to add features cheaply, because each new "find similar" capability reuses the same embeddings and the same index rather than standing up a new system.

Common pitfalls

A handful of mistakes account for most vector-search disappointment, and all of them are easy to avoid once named. Model mismatch is the classic: embedding your documents with one model and your queries with another produces coordinates in incompatible spaces, so results look random — always use the same model on both sides. Forgetting to normalise when your distance metric assumes unit-length vectors silently degrades ranking; pick a metric and make sure your vectors match its assumption. Stale embeddings creep in when documents change but their vectors are not recomputed, so search points at an old version of the truth — treat re-embedding as part of your write path. Pure-vector tunnel vision means missing exact identifiers, error codes, and rare terms that keyword search would have caught, which is the case for hybrid search. And untested recall — shipping an ANN index without ever comparing it to brute-force ground truth — means you have no idea whether you are returning the right results at all. Measure recall once, and the index stops being a black box.

A worked example

Say you have ten million support articles and want "find the three most relevant to this question." Offline, you embed every article with your chosen model and load the vectors into an HNSW index — that is the slow, one-time build. At query time, you embed the incoming question with the same model (a few milliseconds), hand the vector to the index, and it walks its graph and returns the nearest few hundred candidates in single-digit milliseconds. You optionally run a cheap re-ranker over those candidates to sharpen the order, apply any tenant or language filter, and return the top three. If recall ever looks low, you raise ef_search and re-measure; if memory gets tight, you quantize. None of these steps requires the model to "understand" anything at query time — it is all distance arithmetic over precomputed vectors, which is exactly why it serves at scale.

The cost profile is worth internalising because it shapes design. The expensive, one-time work is embedding the corpus and building the index; the per-query work is one embedding call plus a fast index lookup. That asymmetry means adding documents is cheaper than you fear and querying is cheaper still, but a model change is a full rebuild. So the architecture that ages well keeps the embedding step on the write path (embed each document as it is created or updated), keeps the index warm in memory, and treats the choice of embedding model as a long-term commitment rather than a setting. Get those three right and vector search becomes one of the most reliable, lowest-drama parts of an AI system — a precomputed map you simply look points up on.

Embeddings & vector search

What an embedding is

How embeddings are produced

Choosing an embedding model

Measuring closeness

The scaling problem

HNSW: the graph index

IVF and quantization: the clustering family

Recall, the metric that actually matters

Filtering, chunking, and where to store

Why the geometry works at all

Beyond search: the other jobs embeddings do

Common pitfalls

A worked example

Further reading

03 — Inference & serving