The AI stack, from the engineer's seat.
Not how to train a model. How the model you call runs in production: how a prompt becomes tokens, how embeddings turn meaning into coordinates, why serving is a memory problem, and how retrieval and agents bolt real systems onto a next-token predictor. Same level the rest of the codex works at — what is the system actually doing, and where do the costs hide.
Three sub-pages are live, with two more in flight. Each links to its plain-English ELI5 front door and the matching simulator where one exists.
Start here.
How LLMs work
A language model is a next-token predictor wrapped in a loop. Tokenization, embeddings, the transformer block, attention, and autoregressive decoding — the whole path from your prompt to one word at a time, with no maths you do not need.
Embeddings & vector search
Turn text into coordinates so "find similar" becomes "find nearby". What an embedding is, why cosine distance works, and how approximate nearest-neighbour indexes (HNSW, IVF) make search over a billion vectors fast enough to serve.
Inference & serving
Why serving an LLM is a memory problem, not a compute one. The KV cache, prefill vs decode, continuous batching, PagedAttention, and why throughput and latency pull in opposite directions on the same GPU.
Two more, in flight.
The retrieval and agent layers — the two patterns most teams are actually shipping in 2026. In the order they make sense to learn:
- 04Retrieval-augmented generationGive the model an open-book exam. The ingestion and retrieval halves, chunking trade-offs, hybrid search, re-ranking, and how to tell whether a wrong answer came from retrieval or from generation.chunking · hybrid search · re-ranking · evaluation · hallucination
- 05Agents & tool useWhat turns a chat model into something that takes actions. Tool calling, the plan-act-observe loop, memory, MCP, and the guardrails that keep an autonomous loop from doing real damage.tool calling · ReAct · memory · MCP · guardrails