03 / 05

AI systems / 03

Inference & serving

The surprising thing about running a large model is that you usually run out of memory long before you run out of compute. Generating text is a sequential, memory-bound loop, and the thing eating your GPU is not the weights — it is the KV cache, the model's running memory of the conversation. Understand that one structure and the rest of serving — batching, paging, quantization, the latency-throughput tug of war — falls into place.

Generation is a sequential loop

As covered in how LLMs work, a model produces one token, appends it, and runs again. That sequential dependency is the root of every serving challenge. You cannot generate token 50 until you have token 49, so a single request cannot be parallelised across its own output. The GPU, which is built to do tens of thousands of multiplications at once, spends each decode step largely waiting on memory rather than maxing out its arithmetic units. This is the central, counter-intuitive fact of LLM serving: the hardware is usually starved for data, not for compute.

Why memory-bound? Each decode step has to read the model's weights out of memory to multiply them against a single new token's worth of activations. That is a lot of reading for very little arithmetic, so the step finishes as soon as the weights have been streamed in — the multiply-add units sit idle waiting. The implication runs through everything below: the way to use a GPU well on this workload is to give it more useful work to do per byte of memory it reads, which is exactly what batching does, and to read fewer bytes in the first place, which is exactly what quantization does.

Two phases: prefill and decode

Serving splits into two very different phases with opposite shapes. Prefill processes your whole prompt at once to warm up the model's state. All the prompt tokens are available simultaneously, so the GPU can crunch them in parallel; prefill is compute-heavy and fast per token. Decode then emits the answer one token at a time, each step depending on the last; it is memory-bandwidth-bound and slow per token. A long prompt with a short answer is dominated by prefill; a short prompt with a long answer is dominated by decode.

The two phases have opposite shapes, which is why modern stacks increasingly schedule them separately.

The two phases differ enough that newer systems schedule them apart — sometimes on different GPU pools — so that a burst of long prompts doing prefill does not stall the steady drip of decode steps for everyone else. The two user-facing numbers map directly onto the phases: time to first token is dominated by prefill and by how long the request waited in the queue, and inter-token latency is the pace of the stream during decode.

The KV cache: the model's working memory

Recall that attention compares each new token against every earlier token using their key and value vectors. Recomputing the keys and values for the entire history on every single step would be hopelessly wasteful, so the system computes them once per token and keeps them. That store is the KV cache, and it is what lets decode stay roughly constant-cost per token instead of growing with the length of the conversation.

The catch is size. The cache holds a key and a value vector for every token, in every attention layer, for every request in flight. It grows linearly with sequence length and with batch size, and on a busy server it routinely consumes more memory than the model's weights themselves. A common rule of thumb is to reserve roughly half of GPU memory for the KV cache. This is the precise sense in which serving is a memory problem: how many requests you can run at once is set by how much KV cache fits, not by how fast the GPU multiplies. When people say a deployment is "out of memory," they almost always mean it ran out of room for KV cache and had to start queueing or rejecting requests.

Why context length is expensive. Doubling the prompt does two bad things at once: prefill attention is quadratic in length, and the KV cache it leaves behind is twice as large, so it crowds out other requests for the entire generation. Long context is not one cost, it is two — slower to ingest and heavier to hold.

Batching: filling a starved GPU

Because each decode step reads the full weights to serve a single token, running one request at a time wastes almost all the GPU. The fix is to batch: run many requests together so that one read of the weights serves many tokens at once. Batching is the lever that turns a memory-bound, under-utilised GPU into a productive one, and it is why throughput and the number of concurrent users you can serve rise together.

Naive (static) batching groups a fixed set of requests and runs them in lockstep, but it has an ugly failure mode: the whole batch is held hostage by its longest request, and a slot that finishes early sits idle until the rest catch up. Continuous batching fixes this by working at the level of individual decode steps. Every iteration, the scheduler retires finished requests and admits new ones, so a request that completes frees its slot immediately and a newly arrived request joins on the next step.

Static batching wastes the slots that finish early; continuous batching admits a new request the moment one frees up.

This single change can multiply GPU utilisation several times over on mixed traffic, and it is the headline feature of every modern serving engine. It is also why benchmarks quoted as "tokens per second" can be wildly misleading without a batch size attached — the same GPU and model can differ by an order of magnitude between batch-of-one and a full continuous batch.

PagedAttention and the memory analogy

The KV cache used to be stored as one contiguous block per request, sized for the worst case. That wastes memory the way fixed-size partitions waste a disk: a short reply reserves space it never uses, and fragmentation means you cannot fit a new request even though the free bytes exist somewhere. The fix, pioneered by vLLM, is PagedAttention, which manages the KV cache in small fixed-size pages with a lookup table — exactly the way an operating system manages virtual memory. Pages are allocated on demand, internal fragmentation nearly vanishes, and identical prompt prefixes can even share the same physical pages.

That last point — prefix sharing — is quietly important. If a thousand requests all begin with the same long system prompt, paging lets them share one copy of that prefix's KV cache instead of storing a thousand near-identical copies. The payoff across the board is many more concurrent requests on the same card, which is why PagedAttention reset the baseline for what a single GPU can serve.

Quantization: read fewer bytes

Since decode is memory-bound, the other big lever is simply to make the numbers smaller. Quantization stores the weights, and often the KV cache, at lower precision — FP8 or INT4 instead of the usual FP16. Smaller numbers mean fewer bytes to stream out of memory per step, which both speeds up the memory-bound decode and frees room for more KV cache, so you serve more users and serve them faster at once. The cost is a small, measurable accuracy hit.

The art is keeping that hit small. Naively rounding every weight degrades quality, so modern methods are selective. Activation-aware approaches such as AWQ identify the small fraction of weights that matter most to the output and protect them at higher precision while aggressively shrinking the rest. The practical upshot for 2026 is that INT4 weight quantization is a standard production choice for many models, often with little visible quality loss, and FP8 is common where hardware supports it. The right move is to quantize, then measure on your own evals rather than trusting a headline number, because the acceptable precision depends on the task.

Speculative decoding and going multi-GPU

Two more techniques show up once the basics are in place. Speculative decoding attacks the sequential bottleneck directly: a small, fast "draft" model proposes several tokens ahead, and the large model verifies them in a single parallel step, accepting the run up to the first disagreement. When the draft is often right, the big model effectively emits several tokens per step instead of one, cutting latency with no change to the output distribution.

When a model is too large for one GPU — or you want lower latency than one GPU can give — you split it across several. Tensor parallelism splits each layer's matrices across GPUs that work on one token together, lowering latency at the cost of fast interconnect traffic. Pipeline parallelism puts different layers on different GPUs and streams batches through them like a factory line, raising throughput. Real deployments combine these, and the choice is governed by how big the model is and whether you are optimising for latency or for total tokens served.

The trade you cannot escape

Latency and throughput pull against each other on a shared GPU. Bigger batches use the hardware better and raise total tokens-per-second across all users (throughput), but each individual user waits behind more work, so their tokens arrive slower (latency). There is no setting that maximises both; you choose a point on the curve. A consumer chat product leans toward latency and keeps batches modest so replies feel snappy; a bulk document-processing job leans toward throughput and runs the largest batches that fit, because no human is watching any single stream.

Metric	What it measures	Driven by
Time to first token (TTFT)	Wait before the stream starts	queue time + prefill
Inter-token latency (ITL)	Pace of the stream	decode + batch size
Throughput	Total tokens/sec across all users	batch size + utilisation
Cost per million tokens	What it costs to run	GPU price ÷ throughput

Streaming changes the UX math

One reason LLM products feel fast despite slow decode is streaming. Rather than wait for the whole answer, the server sends each token to the client as it is produced, so the user starts reading immediately. This makes time to first token the metric that dominates perceived speed: a response that starts in 200 ms and streams for four seconds feels far better than one that appears all at once after three. It also reshapes how you tune. Because the user reads at human speed, an inter-token latency comfortably faster than reading pace is "fast enough," which frees you to push batch sizes higher for throughput without hurting the experience. The engineering implication is to optimise aggressively for first-token latency — keep prefill quick and queues short — and to treat the decode pace as needing only to clear the reading-speed bar, not to be minimised.

Where requests wait: queueing and admission

A GPU can only hold so much KV cache, so when more requests arrive than fit, the extras wait. How that waiting is handled is a real design choice with user-visible consequences. A naive "accept everything" policy lets the queue grow without bound under load, and time to first token balloons until the system feels broken even though it is technically still serving. Mature deployments apply admission control: cap the queue, shed or reject load past a threshold, and return a clear "busy, retry" rather than an answer that arrives a minute late.

This is the same queueing reality that governs any saturated service — the queueing-theory intuition that latency climbs sharply as utilisation approaches 100% applies directly here. The practical levers are to autoscale the GPU pool against queue depth rather than raw utilisation, to set per-tenant limits so one heavy user cannot starve the rest, and to separate latency-sensitive interactive traffic from bulk batch jobs so the two do not contend for the same slots. Treat the GPU pool as a capacity-constrained queue, and the failure modes become predictable instead of mysterious.

Self-host or call an API

The first serving decision most teams face is whether to run a model at all. A hosted inference API removes everything above — no GPUs, no batching, no paging, no capacity planning — and you pay per token. That is the right default for getting started, for spiky or low volume, and for frontier models you could not run yourself. Self-hosting earns its keep when volume is high enough that per-token pricing dwarfs hardware cost, when data residency or privacy rules forbid sending text to a third party, when you need a fine-tuned or open model the APIs do not offer, or when you need latency and behaviour you fully control.

The honest framing is a crossover point. At low and medium volume, an API is cheaper all-in once you count the engineering time that self-hosting consumes. Past some throughput — which keeps falling as open models and serving engines improve — owning the GPUs wins on unit cost, provided you can keep them busy, since an idle GPU you are renting by the hour is the most expensive way to serve nothing. Many teams run both: an API for burst and for the largest models, self-hosted open models for steady, high-volume, or sensitive workloads.

The serving stack in 2026

Engine	Known for
vLLM	The widely deployed open-source default; introduced PagedAttention and continuous batching
TensorRT-LLM	Highest raw throughput on NVIDIA hardware; heavier to set up
SGLang	Strong prefix caching and structured-output performance

Whichever you run, the levers are the same: quantize to fit more in memory, batch continuously to fill the GPU, cache shared prefixes, split across GPUs when the model demands it, and pick GPUs sized to your model rather than the biggest available. Spot or preemptible instances can cut cost substantially for throughput-oriented, restartable work. The constraint to keep in the front of your mind through all of it is memory.

A capacity sketch

Put numbers on it to make the trade concrete. Suppose a card has 80 GB of memory and the model weights take 40 GB after INT4 quantization. That leaves roughly 40 GB for KV cache. If each token of context costs, say, a few hundred kilobytes of KV across all layers, then the product of (requests in flight) × (their context lengths) is capped by that 40 GB. Halve the per-token KV cost by quantizing the cache and you double the concurrency; let average context lengths grow and you cut it. This back-of-envelope is the whole game: every serving decision is really a move in the budget of "how much KV cache fits, and how busy can I keep the GPU within it." Get a feel for those two quantities for your model and hardware, and capacity planning stops being guesswork.

Putting it together

Stand back and the whole discipline is one idea applied repeatedly: a GPU running an LLM is starved for memory bandwidth and short on memory capacity, so every technique either feeds it more useful work per byte read or shrinks the bytes. Continuous batching feeds it more work. PagedAttention and prefix sharing reclaim wasted capacity. Quantization shrinks the bytes. Speculative decoding squeezes more tokens out of each memory-bound step. Tensor and pipeline parallelism spread a model too big for one card. None of these are tricks bolted on the side; they are all responses to the same physics.

For an engineer choosing or operating a deployment, that gives a clear order of operations: pick a model and quantization that fit your quality bar, run a serving engine that does continuous batching and paging by default, size GPUs by the memory the model plus a healthy KV budget needs, and only then reach for multi-GPU or speculative decoding if latency still falls short. Measure time to first token and inter-token latency under realistic concurrency, not single requests, and watch cost per million tokens as the number that ties it all together. Do that and "serving an LLM" stops being a dark art and becomes ordinary capacity engineering with an unusual bottleneck.

Cold starts and warm pools

A detail that surprises teams moving from stateless web services: model servers are expensive to start. Loading tens of gigabytes of weights from storage into GPU memory and initialising the runtime can take from tens of seconds to minutes, so you cannot scale these the way you scale a cheap container that boots in a second. Scaling to zero between requests sounds thrifty until the first user after an idle period waits a minute for the model to load. The usual answer is a warm pool: keep a minimum number of replicas always loaded, and scale the pool up ahead of demand using queue depth and traffic trends rather than reacting after latency has already spiked.

This interacts with cost. Because an idle loaded GPU still bills, the goal is to hold just enough warm capacity to absorb normal variation while letting genuine peaks queue briefly or spill to an API. Techniques like faster weight loading, snapshotting GPU memory, and sharing a base model across many fine-tuned adapters all exist to soften the cold-start tax, but the architectural point stands: provision for warmth, scale on leading signals, and never assume an LLM replica can appear on demand the way a normal microservice can.

Inference & serving

Generation is a sequential loop

Two phases: prefill and decode

The KV cache: the model's working memory

Batching: filling a starved GPU

PagedAttention and the memory analogy

Quantization: read fewer bytes

Speculative decoding and going multi-GPU

The trade you cannot escape

Streaming changes the UX math

Where requests wait: queueing and admission

Self-host or call an API

The serving stack in 2026

A capacity sketch

Putting it together

Cold starts and warm pools

Further reading

Retrieval-augmented generation