01 / 05

AI systems / 01

How LLMs work

Strip away the mystique and a large language model is one machine doing one job: given the text so far, guess the next token. Everything else — the chat, the apparent reasoning, the code it writes — is that single guess, run in a loop, on a network that has read a large fraction of the written internet. This page walks the whole path from your prompt to one word at a time: tokenization, embeddings, attention, the transformer block, decoding, and the training that shaped it. No maths you do not need.

The one job: predict the next token

A language model never sees a sentence as a sentence. It sees a sequence of tokens and produces a probability for every possible next token. That is the entire model. When you chat with one, the system feeds in everything said so far, the model returns a probability distribution over its whole vocabulary, the program picks one token from that distribution and appends it, and the whole thing runs again with the new, one-token-longer input. Generation is that loop, and nothing more.

This reframing explains a surprising amount. The model has no separate "memory" or "thinking" step that lives outside the text; its only lever on what happens next is the sequence it has already produced. It is also why one model can chat, summarise, translate, extract data, and write code without being separately programmed for each task. All of those are just "what text plausibly comes next," and the training data contained millions of examples of every one of them. The model is a single function from "text so far" to "probability of each next token," and the apparent intelligence is what that function learned in order to make those predictions well.

Keep that frame in mind as we go, because every component below exists to make one prediction better: tokenization decides what the units of prediction are, embeddings give those units meaning, attention lets each unit gather context from the others, the transformer block does this over and over to build understanding, and decoding turns the final prediction back into a token you can read.

The whole pipeline at a glance

Before the detail, here is the shape of the entire forward pass. Text comes in on the left, one token comes out on the right, and the output token is fed back to the start to produce the next one.

One forward pass produces one token. The loop on the bottom is the entire generation process.

Two things are worth noticing already. First, almost all the work is in the middle box — the stack of transformer blocks — which is run from scratch conceptually for every token (real systems cache most of it, which is the subject of the serving page). Second, the only feedback path is the generated token itself. The model does not carry hidden scratch state between steps; the text on the page is its working memory.

Step 1 — tokenization

Models do not work on characters or words; they work on tokens, which are common chunks of text. A tokenizer breaks your input into pieces drawn from a fixed vocabulary, typically 30,000 to 200,000 entries. Frequent words become a single token; rare words get split into several. The word "the" is one token; "tokenization" might be three (token, iz, ation); an emoji, an unusual name, or a chunk of source code can be several. The most common scheme, byte-pair encoding, starts from raw bytes and greedily merges the most frequent adjacent pairs until it has built a vocabulary of the desired size, which is why it can represent literally any input while still giving common words their own single token.

This detail leaks into everything you do with a model. Pricing and rate limits are quoted in tokens, and a model's context window — the maximum amount of text it can consider at once — is a token count, not a word count. A useful rule of thumb for English is roughly four characters, or about 0.75 words, per token, so 1,000 tokens is about 750 words. Other languages can be far less efficient: text in scripts that were rare in the training data may use several tokens per character, which makes the same sentence cost more and consume more of the window.

Tokenization also quietly shapes behaviour. The reason older models struggled to spell words backwards, count the letters in a word, or do precise character-level edits is that they never saw the letters — only the chunks. A model that has only ever seen "strawberry" as two or three tokens has no direct view of its individual r's. Many "the model is bad at this simple thing" complaints trace back to the tokenizer rather than the network.

Practical consequence. If you are counting cost or fitting content into a window, count tokens, not words, and remember code and non-English text are token-heavy. The same prompt can be 20–40% more tokens than its word count suggests.

Step 2 — embeddings turn tokens into meaning

Each token id is looked up in a large table and replaced by a vector — a list of numbers, often a few thousand of them. This embedding is where meaning lives. The model learns these vectors during training so that tokens used in similar ways land near each other in the space, and so that directions in the space line up with real relationships. The textbook example is that the vector arithmetic "king − man + woman" lands near "queen": the model has, without being told to, organised the space so that a consistent direction encodes something like "royalty" and another encodes "gender."

Every token becomes a learned vector; the model arranges the space so related tokens sit close together.

One more ingredient goes in here. A pure bag of token vectors has no notion of order, yet "dog bites man" and "man bites dog" share every token and mean opposite things. So a positional encoding is added to each token's vector to mark where it sits in the sequence. Modern models often use rotary position embeddings, which encode position as a rotation of the vector and generalise more gracefully to longer inputs than the fixed tables earlier models used. After this step the prompt has become a grid of numbers: one vector per token, each carrying both meaning and position. Everything downstream is arithmetic on that grid.

Step 3 — attention, the part that matters

The transformer is built from a stack of identical blocks, and the heart of each block is attention. Attention lets every token look at every other token and pull in the ones that are relevant to it. When the model processes "it" in "the trophy did not fit in the suitcase because it was too big," attention is the mechanism that lets "it" gather information from "trophy" rather than "suitcase" — a disambiguation that depends entirely on context, which is exactly what attention supplies.

Mechanically, each token produces three vectors by multiplying its embedding by three learned matrices: a query (what am I looking for?), a key (what do I offer?), and a value (what I will hand over if chosen). To update a token, the model compares that token's query against every token's key to get a relevance score for each, runs those scores through a softmax so they become weights that sum to one, and then takes a weighted blend of all the values. A token that is highly relevant contributes most of its value; an irrelevant one contributes almost nothing.

Self-attention: each token blends in the values of the tokens it finds most relevant. Thicker line, higher weight.

Two refinements make this work at scale. First, attention is done with several heads in parallel, each with its own query/key/value matrices, so different heads can specialise — one might track subject-verb agreement, another might follow quotation marks, another might link pronouns to their referents. Their outputs are concatenated and mixed. Second, during generation the model uses causal masking: a token may only attend to tokens before it, never after, because at generation time the later tokens do not exist yet. This left-to-right constraint is why the model builds its answer one step at a time.

Why attention is expensive. Comparing every token against every other token is quadratic in the sequence length. Double the prompt and you roughly quadruple the attention work, and you double the memory the model must keep around. That single fact drives most of the cost and engineering in serving, and is the reason long context is hard and not free.

The transformer block in full

Attention is only half of a block. After the attention step mixes information across tokens, a small feed-forward network is applied to each token position independently — the same little two-layer network, run on every position — which is where much of the model's learned knowledge is stored. Around both the attention and the feed-forward steps sit two pieces of plumbing that make deep stacks trainable: a residual connection that adds the input back to the output (so information and gradients can skip the layer), and normalization that keeps the numbers in a stable range.

One block. The dashed residual lines carry the input around each step. A model stacks dozens of these.

That is one block. A real model stacks dozens of them — a small model might have 12, a frontier model over a hundred — each with its own learned weights. Early blocks tend to capture local, surface patterns; deeper blocks capture longer-range structure and more abstract relationships. The "depth" of a model is essentially how many times it gets to refine its representation of the sequence before producing an answer, and the parameter count people quote is dominated by the weights in all these attention and feed-forward layers.

Step 4 — decoding, one token at a time

After the final block, the model takes the vector at the last position and multiplies it by an "unembedding" matrix to produce a raw score (a logit) for every token in the vocabulary. A softmax turns those scores into a probability distribution. Now the system has to choose a token, and the choice is its own small design space. Always taking the single most likely token (greedy decoding) tends to produce flat, repetitive text and can get stuck in loops, so most systems sample, shaped by a few knobs.

Knob	What it does	Turn it up and…
Temperature	Scales the logits before softmax, flattening or sharpening the distribution	output gets more random and varied; near 0 it is almost deterministic
Top-p (nucleus)	Samples only from the smallest set of tokens whose probabilities sum to p	more candidates stay in play, more diverse text
Top-k	Samples only from the k most likely tokens	a larger pool of candidates each step
Repetition penalty	Down-weights tokens already produced	fewer loops and less verbatim repetition

The chosen token is appended to the sequence, and the entire forward pass runs again to produce the next one. Set temperature to zero and the model is nearly deterministic for a given prompt; raise it and the same prompt gives different answers each time. This is also the honest answer to "why does it hallucinate." The model is optimised for plausible next tokens, not true ones, and a fluent, confident, wrong answer can be more probable than an awkward correct one or an admission of ignorance. Hallucination is not a bug bolted on; it is the same next-token machinery doing exactly what it was trained to do when the truth is not strongly represented in its weights or its context.

Where the memory goes: context and the KV cache

Because attention recomputes relationships across the whole sequence, the model needs the keys and values of every previous token available at each step. Recomputing them from scratch every time would be hopelessly wasteful, so serving systems compute them once per token and keep them in a store called the KV cache. That cache is the model's running memory of the conversation, and on a busy server it usually consumes more memory than the model's own weights. The size of the context window is, in practice, a budget on how large that cache is allowed to grow. The mechanics — prefill versus decode, batching, and why all of this makes serving a memory problem rather than a compute one — are the subject of the inference and serving page.

A worked example, start to finish

Tie the pieces together with one concrete step. You send the prompt "The capital of France is". The tokenizer splits it into the tokens The, capital, of, France, is — five token ids. Each id is looked up in the embedding table and becomes a vector, and a positional signal is added so the model knows "France" came fourth and "is" came fifth. The result is a grid of five vectors.

That grid flows up through the stack of transformer blocks. In the lower blocks, attention links "capital" and "of" and "France" into a single notion of "the capital-of relationship applied to France"; the feed-forward layers, which is where factual associations live, surface the strong link between that pattern and "Paris." By the top of the stack, the vector sitting at the final position ("is") has absorbed everything needed to predict what comes next.

The unembedding step turns that final vector into a logit for every token in the vocabulary. "Paris" gets a very high score; "London," "the," and a few thousand others get lower ones; almost everything gets near zero. Softmax converts those scores to probabilities — perhaps 0.92 for "Paris" — and the sampler, at a low temperature, picks "Paris." That token is appended to the sequence, and the model runs the entire pass again on six tokens to decide whether the next one is a full stop, a comma, or more words. Multiply that loop by a few hundred and you have a paragraph. Everything you experience as the model "knowing" the capital of France is contained in that one high logit, produced by weights that were nudged toward it during training every time the pattern appeared in the data.

Training, in two phases

Two phases produce the model you actually call. Pre-training shows the network enormous amounts of text and asks it, over and over, to predict the next token. Each time it is wrong, the error is propagated back through every layer and nudges billions of parameters a tiny amount in the direction that would have been less wrong. Repeat across trillions of tokens and the network becomes very good at next-token prediction, which forces it to internalise grammar, facts, styles, and a great deal of world structure as a side effect. The result is a model that completes text but does not necessarily answer questions or follow instructions — ask a raw base model a question and it might continue with more questions, because that is a plausible continuation.

A second, much smaller phase turns that base model into an assistant. Instruction tuning fine-tunes it on examples of prompts paired with good responses, teaching it the chat format and the habit of actually answering. Then preference tuning — often reinforcement learning from human feedback, or newer variants — shows the model pairs of responses ranked by people and adjusts it toward the preferred kind. The tone, the helpfulness, the refusals, and much of what feels like "personality" all come from this second phase shaping the same next-token machine. None of it changes the fundamental mechanism; it changes which continuations the model finds most probable.

Common misconceptions, cleared up

A few mental traps are worth naming directly, because they lead to real engineering mistakes. The model is not looking things up in a database when it answers; everything it "knows" is baked into its weights as statistical structure, which is why it can be confidently wrong and why retrieval-augmented generation exists to feed it facts at query time. It does not plan ahead in any explicit sense; it commits to one token before considering the next, which is why prompting it to "think step by step" helps — it literally puts useful intermediate tokens into the context before the answer is due. It has no persistent memory between separate conversations unless your application supplies one. And it does not have a separate "reasoning module": chain-of-thought and reasoning models work by spending more tokens, not by switching to a different kind of computation.

What this buys you as an engineer

Holding the right model makes these systems predictable instead of magical. The context window is finite and attention is quadratic, so prompts have a real, measurable cost and very long inputs get slow and expensive — trim context rather than dumping everything in. The model only knows what is in its weights or in the prompt, so when it needs current or private facts you must put them in the prompt, which is the entire premise of retrieval. Output is sampled, so identical requests can differ unless you pin temperature to zero, and even then upstream changes can shift results. Because generation is left-to-right with no take-backs, techniques that let the model produce reasoning tokens before its final answer measurably improve quality. And because everything is tokens, your costs, your latency, and even some of the model's quirks are best understood by thinking in tokens, not words.

How LLMs work

The one job: predict the next token

The whole pipeline at a glance

Step 1 — tokenization

Step 2 — embeddings turn tokens into meaning

Step 3 — attention, the part that matters

The transformer block in full

Step 4 — decoding, one token at a time

Where the memory goes: context and the KV cache

A worked example, start to finish

Training, in two phases

Common misconceptions, cleared up

What this buys you as an engineer

Further reading

02 — Embeddings & vector search