Retrieval-augmented generation
A model only knows what it saw during training, and it states guesses with the same confidence as facts. RAG works around both problems by turning a closed-book exam into an open-book one: before the model answers, the system fetches the relevant pages and pastes them into the prompt. The idea fits in a sentence. The engineering — what to fetch, how to cut it up, how to rank it, and how to tell why an answer went wrong — is a search problem stapled to a language model, and the search half is where most of the failures live.
Why bolt search onto a language model
Three problems push teams toward RAG, and it helps to name them because they have different fixes. First, staleness: the model's knowledge stops at its training cutoff, and retraining to learn yesterday's pricing change is absurd. Second, private data: the model has never seen your internal wiki, your contracts, or your codebase, and no amount of prompting conjures knowledge it does not have. Third, hallucination: as how LLMs work covers, the model is a next-token predictor — when it lacks the fact, it produces the most plausible-sounding token sequence anyway. Retrieval attacks all three at once: fetch current, private, relevant text and the model can ground its answer in something real, ideally with a citation a user can check.
It is worth saying what RAG does not do. It does not make the model smarter, and it does not make hallucination impossible — a model can still ignore the retrieved text, misread it, or answer beyond it. What retrieval changes is where the facts come from and whether you can audit them. That reframing matters for debugging later: a RAG system is a search engine and a summariser in series, and either stage can be the one that failed.
Two pipelines, not one
Every RAG system splits into an offline half and an online half, and keeping them separate in your head is the single most useful mental model on this page. Ingestion runs ahead of time: collect documents, split them into chunks, embed each chunk into a vector, and store vectors plus text plus metadata in an index. Retrieval runs per query: embed the question, search the index for the nearest chunks, optionally re-rank them, assemble a prompt, and generate. They run at different rates, fail in different ways, and are owned by different code paths — ingestion bugs corrupt the index slowly and silently, retrieval bugs show up on the next query.
The asymmetry has practical consequences. Ingestion decisions are expensive to revisit — changing your chunking scheme or embedding model means re-processing the whole corpus — while retrieval decisions can change per deploy or even per query. So teams iterate fast on the online half and discover too late that the ceiling on quality was set offline. If retrieved chunks keep looking subtly wrong, suspect ingestion first.
Chunking: the decision that haunts you
Documents are too long to embed whole — a single vector for a fifty-page contract averages away everything specific in it — so ingestion splits them into chunks. Chunk size is a genuine trade-off with no right answer, only a fit to your corpus and queries. Small chunks (a sentence or two) embed crisply, so retrieval is precise, but each one carries so little context that the model may receive a fact stripped of the conditions around it: a refund clause without the eligibility paragraph above it. Large chunks (a page or more) keep context intact but embed mushily — one vector now represents several topics, so the chunk matches queries it only partially answers and drags irrelevant text into the prompt.
| Choice | Retrieval precision | Context for the model | Typical failure |
|---|---|---|---|
| Small chunks (~1–3 sentences) | High — vectors are specific | Poor — facts arrive stripped of caveats | Right fact, wrong conclusion |
| Medium chunks (~a paragraph to ~500 tokens) | Good | Good | The common default for a reason |
| Large chunks (a page+) | Low — vectors average topics | Rich but noisy | Relevant chunk ranks below vaguer ones |
Beyond size, structure matters more than people expect. Splitting on fixed character counts cuts sentences and tables in half; splitting on document structure — headings, paragraphs, list items, code blocks — keeps units of meaning intact and is almost always worth the parsing effort. A modest overlap between adjacent chunks (commonly 10–20%) insures against a fact straddling a boundary. Two refinements pay for themselves on most corpora: prepend each chunk's breadcrumb (document title and section heading) to its text before embedding, so "the limit is 100 requests" becomes findable as a rate-limiting fact; and store more context than you embed — retrieve by the small, precise chunk but hand the model the surrounding section. Decoupling the retrieval unit from the generation unit removes most of the size trade-off.
Embedding and indexing the chunks
Each chunk goes through an embedding model and comes out as a vector — a point in a space where nearby means semantically similar. The mechanics live in embeddings & vector search; what matters here is the contract: query and chunks must be embedded by the same model (vectors from different models live in unrelated spaces), and swapping the embedding model means re-embedding everything. The vectors land in an ANN index — HNSW or IVF, in a dedicated vector database or a pgvector column — alongside the chunk text and its metadata.
Metadata is the unglamorous half of the index and a frequent root cause of "RAG is broken" tickets. Source document, section, timestamp, tenant, access permissions: storing these lets retrieval filter before or alongside the vector search, so "what changed in the March release" searches only March documents instead of hoping cosine similarity notices dates (it will not). Permission filtering deserves special paranoia — a vector index happily returns the best-matching chunk regardless of who is asking, so access control has to be enforced as a filter in the retrieval query, not assumed. Leaking one tenant's contract into another tenant's answer is the RAG version of a cross-tenant data breach, and it has happened.
Hybrid search: dense and sparse together
Embedding similarity — dense retrieval — is good at meaning and bad at strings. Ask for "error E-1047" or "the Hadfield account" and the embedding may not place that exact token anywhere useful, because identifiers, product codes, names, and rare jargon are precisely the things a semantic space smooths over. Classical lexical search — BM25, the term-frequency-based scoring that powered search engines for decades — has the opposite profile: it nails exact terms and misses paraphrases entirely. A query about "letting staff go" never lexically matches a policy titled "involuntary termination."
Hybrid search runs both and merges the results, and it is the production default because real query streams contain both kinds of queries. The usual merge is reciprocal rank fusion (RRF): score each document by the sum of 1/(k + rank) across the result lists it appears in, with k a smoothing constant (60 is the conventional choice). RRF works on ranks rather than raw scores, which sidesteps the problem that BM25 scores and cosine similarities are on incomparable scales. The chunks both retrievers agree on float to the top; the chunks only one finds still make the list.
Re-ranking: a second, more careful look
First-stage retrieval is built for speed: the query and every chunk were embedded independently, and the index just measures distance between vectors that have never seen each other. That independence is what makes searching millions of chunks cheap, and it is also why the ranking is approximate — the embedding of a chunk cannot know what question it will be compared against. A re-ranker fixes this with a second stage: take the top 50–100 candidates, run each one together with the query through a cross-encoder model that reads both texts jointly, and re-sort by its relevance score. Reading query and chunk together captures interactions a distance between two independent vectors cannot — which entity the question is actually about, whether the chunk answers it or merely mentions the same words.
The cost structure explains the two-stage shape. A cross-encoder is far too slow to run against the whole corpus — it is a full model forward pass per query-chunk pair — but perfectly affordable for a hundred candidates. So the cheap stage casts a wide net and the expensive stage sorts the catch. In practice re-ranking is one of the highest-value upgrades in the stack: it lets you be generous with first-stage recall (fetch 100 candidates instead of agonising over the top 5) and still hand the model only the handful that survive scrutiny. The price is added latency — tens to a couple of hundred milliseconds — which sits directly on the user's critical path, so latency-sensitive products tune candidate counts with care.
The query side has its own small bag of upgrades. User questions are often terrible search queries — "it still doesn't work after that" retrieves nothing useful without the conversation behind it — so production systems commonly have the model rewrite the query first: resolve pronouns from chat history, expand acronyms, or split a compound question into several sub-queries retrieved independently. Each rewrite is one more model call of latency, so teams add them when the eval numbers say the naive query is the bottleneck, not before.
Assembling the prompt
Retrieval ends with a ranked list of chunks; generation begins with a prompt. The assembly step between them looks trivial and is not. The context window is a budget — retrieved chunks compete with the system prompt, conversation history, and room for the answer — and as inference & serving explains, every token of context costs prefill time and KV-cache memory, so "stuff in everything that might help" has a real bill attached. More context is also not monotonically better for quality: models attend most reliably to the beginning and end of a long context, and material buried in the middle gets measurably less attention — the "lost in the middle" effect. Feeding the model thirty mediocre chunks routinely produces worse answers than feeding it the five best.
Standard assembly practice: keep the strongest chunks, place them near the start or end of the context, label each with its source (document, section, date) so the model can cite, and instruct the model to answer only from the provided material and to say so when the material does not contain the answer. That last instruction is your honesty valve. A RAG system that can say "the documents don't cover this" is dramatically more trustworthy than one that always produces something, and models follow the instruction imperfectly — which is exactly what the faithfulness evaluation below is for.
Evaluation: score the halves separately
The cardinal rule of RAG evaluation is to measure retrieval and generation separately, because an end-to-end "was the answer good" number cannot tell you which half to fix. Retrieval is evaluated like the search problem it is. Build a golden set of real queries, each labelled with the chunks (or source passages) that actually answer it, and compute recall@k — the fraction of queries where a correct chunk appears in the top k — plus a rank-aware metric like MRR or nDCG if position matters to you. Recall@k is the number to watch first: if the right chunk is not in what you hand the model, nothing downstream can save the answer.
Generation is evaluated given the retrieved context, and the property that matters most is faithfulness (also called groundedness): is every claim in the answer supported by the supplied chunks? A separate axis, answer relevance, asks whether the response actually addresses the question rather than summarising the context at it. Faithfulness is usually scored by an LLM judge — decompose the answer into claims, check each against the context — which works well enough to be standard but inherits the judge's blind spots, so calibrate it against a sample of human labels before trusting trend lines, and re-calibrate when you change the judge model.
| Stage | Question it answers | Core metrics |
|---|---|---|
| Retrieval | Did the right chunks reach the prompt? | recall@k, MRR / nDCG |
| Generation | Is the answer supported by those chunks? | faithfulness, answer relevance |
| End to end | Did the user get a correct answer? | human/judge-graded correctness on a golden set |
The golden set is the asset worth maintaining. Fifty to a few hundred real queries with labelled answers, refreshed as the corpus changes, lets you run every chunking, embedding, and re-ranking experiment against the same yardstick — without it, tuning RAG is folklore. Wire the eval into CI for the ingestion pipeline too: an innocuous parser change that starts mangling tables will show up as a recall drop long before users report it.
Debugging: retrieval failure or generation failure?
A user reports a wrong answer. The single most useful habit in operating a RAG system is to look at the retrieved chunks before doing anything else, because the fix differs entirely depending on which half failed. If the correct information never made it into the prompt, you have a retrieval failure, and no amount of prompt engineering will fix it — the model cannot cite what it never saw. If the correct information was in the prompt and the answer still contradicts or ignores it, you have a generation failure, and re-tuning your chunking will waste a week.
| Symptom | Likely half | Where to look |
|---|---|---|
| Answer cites nothing relevant; chunks are off-topic | Retrieval | Query embedding, hybrid weighting, chunking, metadata filters |
| Right document, wrong section retrieved | Retrieval (ingestion) | Chunk boundaries, missing heading context, chunk size |
| Right chunks retrieved, answer contradicts them | Generation | Prompt instructions, context placement, model choice |
| Answer is correct but unsupported by chunks | Generation | Model answering from parametric memory — fine until it isn't; tighten grounding instructions |
| Answer blends two entities or versions | Either | Conflicting chunks retrieved (dedupe, recency filter) or model merging them (prompt) |
| "The documents don't cover this" when they do | Retrieval, usually | recall@k on that query; vocabulary mismatch between query and corpus |
This triage only works if you can see the chunks, which makes logging non-negotiable: for every request, persist the query, the rewritten query if any, the retrieved chunks with their scores, and the final prompt. The day a customer escalation arrives, replaying that trace answers "which half" in minutes. It also feeds the golden set — every confirmed failure is a labelled eval case you did not have yesterday.
Keeping the index honest
In production, the ingestion half becomes a data pipeline with all the usual obligations. Documents change, so you need re-indexing — either scheduled re-crawls or event-driven updates — and a deletion path that actually removes withdrawn documents from the index, not just the source system; a chunk that outlives its document is a stale-fact generator with a straight face. Embedding-model upgrades are a migration: old and new vectors are incompatible, so you re-embed the corpus, usually into a parallel index, validate recall on the golden set, then cut over. And the permission filters mentioned earlier need tests of their own, because nothing about a vector index enforces them for you.
It is also worth revisiting whether RAG is the right tool as models change. Long-context models can swallow whole document sets, which replaces retrieval for small, stable corpora — at a per-query token cost that grows with the corpus and a quality risk from the lost-in-the-middle effect. Fine-tuning teaches a model style, format, and skills, but is a poor vehicle for facts that change. The honest summary: for large, dynamic, or permissioned corpora, retrieval remains the practical answer, and the techniques on this page are about making the search half — which was always the hard half — actually good.
Putting it together
RAG looks like an AI feature and behaves like a search product. The model at the end is the most replaceable part of the stack; the durable engineering is everything upstream — structure-aware chunking with context attached, hybrid retrieval because neither dense nor sparse search is sufficient alone, a re-ranking stage that buys precision back from a generous first pass, and prompt assembly that respects both the token budget and the model's uneven attention. Around it all, the operational loop: a golden set, recall and faithfulness measured separately, full traces of what was retrieved, and the discipline of asking "which half failed?" before touching anything.
Build the skeleton simply — medium chunks, hybrid search, a re-ranker, top-5 into the prompt — and spend your effort on evaluation and traces from day one. Teams that can measure retrieval quality improve steadily; teams that can only eyeball final answers thrash. The next page takes the same model and hands it tools instead of documents: agents & tool use, where retrieval becomes just one of the actions a model can decide to take.
Further reading
- Lewis et al. (2020) — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the paper that named the pattern.
- Karpukhin et al. (2020) — Dense Passage Retrieval — the case for learned dense retrieval over pure lexical search.
- Liu et al. (2023) — Lost in the Middle: How Language Models Use Long Contexts — why chunk placement in the prompt matters.
- Semicolony — Embeddings & vector search — the index underneath the retrieval half: cosine distance, HNSW, IVF.
- Semicolony — pgvector vs Pinecone vs Weaviate — choosing where the vectors live.