Every step from user request to streamed response. Click any stage to drill in.
An LLM inference request passes through 11 distinct stages grouped into three phases. Each solves a different problem, has different bottlenecks, and uses different optimisation strategies. Click any stage to explore.
LLM routing must be GPU-state-aware — traditional load balancers fail because they can't see KV cache utilization, queue depth, or model placement on each worker.
A restaurant host seating a returning customer at the same table where their appetizers are already waiting, instead of assigning a random empty seat.
If a user sends a follow-up message in a conversation, the KV cache for previous turns may still reside on a specific GPU. KV-cache aware routing directs the request back to that GPU, skipping prefill for the cached prefix entirely. This can eliminate 95% of TTFT.
Frameworks like llm-d (Kubernetes-native) and vLLM Router (Rust-based) implement this by polling each worker's cache state and using hash-based or prefix-matching lookups.
The defining architecture of 2025. Prefill is compute-bound (parallel matrix ops); decode is memory-bound (sequential KV reads). Running both on the same GPU causes interference — a long prefill blocks decode iterations, spiking latency for in-progress generations.
The solution: dedicated prefill GPUs compute the initial KV cache, then transfer it (via RDMA/NVLink) to dedicated decode GPUs. Each tier scales independently. Meta, Mistral, and Hugging Face run this in production. Gains: 2–7x throughput.
vLLM Router: Rust-based, lightweight load balancer engineered for vLLM. State-aware, understands prefill/decode disaggregation patterns.
Kubernetes Gateway API Inference Extension: Model-aware routing at the K8s ingress level, supporting per-request criticalities and GPU-specific metrics.
NVIDIA Dynamo: Next-gen distributed inference framework with built-in disaggregation, dynamic GPU scheduling, and LLM-aware request routing. Up to 30x more requests served (DeepSeek-R1 on Blackwell).
Per-customer rate limits (requests/sec, tokens/min) map directly to pricing tiers. Queue vs. reject during spikes is a business decision: queuing adds latency but preserves revenue, rejecting loses it. Rate limits are pricing levers, not just technical constraints.
The base model stays resident on GPU while LoRA adapters are swapped per-request. This enables hundreds of fine-tuned variants from a single GPU deployment. Router efficiency — how well requests are matched to loaded adapters — is the most important unit economics driver.
When a model isn't already loaded on a GPU, loading weights from disk takes 30–120 seconds for large models. Hot models are load-balanced across replicas; cold models either incur this latency or require pre-warmed standby GPUs (costly idle capacity).
Prompt construction order matters — static content first, dynamic content last — to maximize prefix cache hit rates in downstream GPU stages.
A chef's mise en place — washing, chopping, and measuring all ingredients before the stove turns on. No GPU time is used here.
Optional content moderation runs before any GPU spend. A lightweight safety classifier screens requests for harmful content, preventing expensive GPU cycles on requests that would be filtered anyway. This is especially important at scale where abusive traffic can waste significant compute budget.
Larger vocabularies produce fewer tokens per input, directly reducing cost and latency — at the trade-off of a bigger embedding table.
Learning common phrases in a foreign language: 'good morning' becomes one unit instead of eleven letters, so your conversations get shorter and faster.
| Model | Year | Vocab Size |
|---|---|---|
| Llama 2 | 2023 | 32,000 |
| Llama 3 | 2024 | 128,256 |
| Mistral Nemo | 2025 | ~131,000 |
| Gemini 3 | 2025 | 262,000 |
Larger vocabularies mean fewer tokens per input (lower cost, faster inference) but larger embedding matrices. There is a log-linear relationship between vocabulary size and training loss.
SentencePiece treats input as a raw character stream with no pre-tokenization, encoding spaces as the metasymbol ▁. It supports both BPE and Unigram algorithms and handles any language without language-specific preprocessing.
SuperBPE (COLM 2025): Two-pass BPE that learns cross-word "superword" tokens. Produces 33% fewer tokens and improves performance by 4.0% across 30 benchmarks.
BoundlessBPE: Relaxes word boundary constraints, achieving up to 15% improvement in bytes-per-token.
LiteToken (Feb 2026): Identifies and removes "intermediate merge residues" — tokens frequent during BPE training but rarely used in final output.
tiktoken (OpenAI, Rust core) is the fastest tokenizer at 3–6x faster than alternatives. It is inference-only (no training support) and powers OpenAI models, Llama 3+, and Mistral's Tekken tokenizer.
Tokens are the fundamental unit of cost and latency. Average characters per token vary dramatically by content type:
Japanese users pay ~2x more per word due to tokenizer inefficiency. Different models use different tokenizers, so the same text produces different token counts. Most providers charge input and output tokens separately because output tokens (sequential decode) cost more to serve.
Embeddings give tokens meaning; positional encoding gives them order. Without position, 'dog bites man' and 'man bites dog' look identical to the model.
Giving each word a GPS coordinate in meaning-space, then stamping it with a sequence number so the model knows what came first.
[vocab_size, d_model] maps each token ID to a learned vector. For Llama 2 7B: 32,000 × 4,096 = ~131M parameters. This is a simple table index, not a matrix multiplication. The vectors encode semantic meaning learned during training.
RoPE encodes position by rotating query and key vectors in 2D subspaces using sinusoidal functions. It is parameter-free, inherently captures relative positions, and scales gracefully to long contexts.
Extensions like YaRN and NTK-aware scaling allow context lengths far beyond training length. Used by LLaMA, Mistral, GPT-NeoX, and most open-weight models.
Instead of modifying embeddings, ALiBi adds a linear bias directly to attention scores based on token distance. It shows better extrapolation beyond the training window and trains faster than RoPE. Used in MPT and some specialized models.
Google's Gemma 3N introduced PLE for mobile inference. Rather than one large initial embedding, PLE generates smaller, layer-specific embeddings cached to slower storage (flash memory) and loaded as each layer runs. This dramatically reduces active memory footprint for on-device models.
Continuous batching is the single biggest throughput optimization — it keeps GPUs busy by replacing finished sequences with new ones every iteration.
A barber who starts the next haircut immediately when a chair opens, instead of waiting for an entire group appointment to finish.
Requests are grouped into fixed-size batches. The entire batch waits until the slowest request finishes. If one request generates 10 tokens and another generates 500, the short request idles for the long one. GPU utilisation: 30-60%.
The breakthrough technique. Each sequence finishes independently and is immediately replaced with a new request at every decode iteration. The batch composition changes dynamically.
All major frameworks support it: vLLM, SGLang, TensorRT-LLM ("in-flight batching"), LMDeploy, TGI.
Long prompts are split into chunks processed iteratively, interleaved with decode steps. This prevents a single large prefill from blocking all in-progress decode iterations (head-of-line blocking). Critical for maintaining low TPOT under mixed workloads.
Prefill processes all input tokens in parallel and is compute-bound. Its speed directly determines Time to First Token (TTFT).
Reading an entire exam question before writing your answer — you must process everything first, but at least you can read all words simultaneously.
Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The resulting K and V tensors are stored in the KV cache.
When the KV cache for a prompt prefix already exists, prefill goes from O(n²) GPU compute to O(n) storage I/O, eliminating 95% of TTFT. vLLM's Automatic Prefix Caching achieves 87%+ cache hit rates with well-structured prompts and 88% faster TTFT for warm cache hits.
Apple's KV-Runahead generates KV caches for later layers in parallel while earlier layers are still processing, overlapping computation and reducing total prefill time.
Run prefill on compute-optimised hardware and decode on memory-bandwidth-optimised hardware. Prefill GPUs handle the heavy parallel matrix multiplications, then transfer the KV cache to decode GPUs via RDMA. Each tier scales independently, improving both TTFT and throughput for high-volume, latency-sensitive workloads.
The KV cache trades memory for speed — it avoids recomputing attention for previous tokens, but often consumes more GPU memory than the model weights themselves.
Keeping sticky notes of every previous conversation turn so you don't re-read the entire chat history — efficient, but your desk fills up fast.
Borrows virtual memory and paging from operating systems. GPU memory is divided into fixed-size physical blocks (e.g., 16 tokens each). Each sequence's KV cache maps to logical blocks that point to scattered physical blocks via a block table (like a page table).
Blocks are allocated on demand as tokens are generated, not pre-allocated for max length. Multiple requests sharing a prefix can point to the same physical block (copy-on-write).
Token-level: Evict unimportant tokens, dynamically allocate memory budget, merge similar KV pairs, quantise cached values to INT8/INT4.
Offloading: Move KV cache to CPU DRAM or disk when GPU memory is full, with intelligent prefetching.
Multi-tier storage: LMCache supports GPU DRAM → CPU DRAM → local disk → remote storage hierarchy.
DeepSeek V2/V3's MLA compresses K and V into a low-dimensional latent vector. Only the compressed latent is stored in the KV cache; at inference, it is projected back to full K/V space. Result: 93.3% KV cache reduction vs MHA, slightly outperforming in quality.
Naive stateless (default): KV cache discarded after each request. Simple but wasteful — every turn re-prefills the entire conversation history.
Prefix caching (optimised): Server keeps recent KV caches in GPU memory. When the next turn arrives with an identical prefix, only new tokens are prefilled. Classic cache eviction problem — LRU or TTL-based eviction policies decide what stays.
Routing complexity: KV cache lives on specific GPUs. The next request must route to the same GPU(s), creating session affinity and potential hot spots. If the user edits a previous message, cached KV is stale and must be invalidated.
GQA is the 2025 standard: near-full-quality attention with a fraction of the memory cost. FlashAttention makes any variant faster via hardware-aware tiling.
A study group sharing notes — instead of everyone writing independent copies (MHA), small groups share one set (GQA). Less paper, nearly the same understanding.
MHA (Multi-Head Attention): Original Transformer. Each head has independent Q, K, V. Maximum expressivity, highest KV cache cost.
MQA (Multi-Query Attention): All query heads share one K/V head. KV cache reduced by num_heads× (e.g., 32x). Lower quality.
GQA (Grouped-Query Attention): The 2025 standard. Query heads grouped, each group shares one K/V head. Example: 32 Q heads with 8 KV heads = 4x KV cache reduction. Optimal quality-efficiency balance. Used in LLaMA 3, Mistral, most open models.
Not a different attention variant — an IO-aware implementation that makes any attention pattern faster. Core insight: standard attention is bottlenecked by memory bandwidth, not compute.
Solution: Tile Q, K, V into blocks. Load blocks from HBM to SRAM (fast, small on-chip memory). Compute attention entirely in SRAM. Write only the final output back — never materialising the full N×N attention matrix.
FlashAttention-3 (2025): Adds async loading (overlap data transfer with compute), FP8 support, and Hopper-specific optimisations.
MLA compresses K and V into a low-dimensional latent vector before caching. At inference, the latent is projected back. In "absorb mode": 71x less memory per layer (98.6% reduction). Slightly outperforms MHA in quality.
Decode generates one token at a time and is memory-bandwidth-bound — the GPU spends most of its time waiting for data reads, not computing.
Writing a story one word at a time, where for each word you must re-read all your previous notes — the bottleneck isn't thinking, it's flipping through pages.
A small "draft model" (e.g., 1B params) generates K candidate tokens quickly. The large target model (e.g., 70B) verifies all K tokens in a single forward pass (prefill-like parallelism). Tokens are accepted left-to-right; the first rejected token is resampled.
The output distribution is mathematically identical to running the target model alone — this is lossless acceleration.
2025 advances: Block verification (5-8% additional speedup), Online Speculative Decoding (adapts draft to query distribution), Self-Speculative Decoding (uses early-exit layers, no separate model), Medusa (parallel draft heads).
Arithmetic intensity = compute operations ÷ bytes loaded. During decode, this ratio is abysmal: the GPU loads the entire weight matrix for a single matrix-vector multiply. For 70B at FP16, that's loading ~140 GB from HBM at 3.35 TB/s → ~42ms minimum per token → ~24 tokens/sec. The GPU's ~990 TFLOPS sit mostly idle, starved for data.
Batching fixes this: loading weights once and processing 32 requests turns the vector ops into efficient matrix ops. Same bandwidth cost, 32x more useful compute.
For every output token, the GPU processes layer by layer:
| Level | Size | Speed | Role |
|---|---|---|---|
| NVMe Disk | TBs | ~7 GB/s | Weight storage (cold start) |
| CPU RAM | 100s GB | ~64 GB/s | Transit to GPU |
| HBM | 80-192 GB | 3,350 GB/s | Weights + KV cache (hot) |
| SRAM | ~50 MB | ~19 TB/s | On-chip compute cache |
SRAM is 5.7x faster than HBM but holds only ~0.036% of a 70B model. Each layer's weights must be tiled through SRAM in ~34 chunks, then evicted for the next layer. Prefetching overlaps loading Layer N+1 while computing Layer N.
The order of sampling operations matters: penalties → temperature → truncation → softmax → sample. Min-P is the recommended truncation method for 2025.
Choosing a restaurant: first eliminate closed ones (truncation), adjust for how adventurous you feel (temperature), then pick from what's left.
Temperature scales logits: p_i = exp(z_i / T) / Σexp(z_j / T)
T = 1.0: Default. T < 1.0: Sharper, more deterministic. T > 1.0: Flatter, more creative. T → 0: Greedy decoding (always pick highest probability).
Top-K: Keep K highest-probability tokens. Problem: fixed K is context-inappropriate.
Top-P (Nucleus): Keep smallest set whose cumulative probability exceeds P. Dynamically adaptive, but coupled to temperature.
Min-P (ICLR 2025): Filter tokens below min_p × max_probability. Consistently outperforms Top-P, especially at higher temperatures. The recommended truncation method for 2025.
Repetition penalty: Multiplicative penalty on logits of recently-seen tokens.
Frequency penalty: Additive penalty proportional to token occurrence count.
Presence penalty: Flat additive penalty on any token that has appeared at all (binary).
LZ penalty (2025): Information-theoretic penalty based on Lempel-Ziv complexity, detecting repeated n-gram patterns.
The newest method. Addresses temperature coupling — probability-based truncation produces identical token sets regardless of temperature. Top-n-sigma operates in logit space, keeping tokens within n standard deviations of the mean logit. Fully decoupled from temperature.
Uses finite state machines to mask invalid tokens at each decode step. The FSM tracks position in the output grammar (e.g., "just opened a JSON key string — only valid characters or closing quote allowed") and zeros out tokens that would produce invalid output.
Actually uses pushdown automata (FSMs with a stack) to handle nested structures like JSON braces/brackets. Performance: O(1) per token — microseconds of overhead. After filtering, remaining tokens are renormalized to sum to 1, subtly concentrating probability mass.
Tokens can't always be decoded independently — partial characters must be buffered until a valid text boundary is reached before sending to the client.
A simultaneous translator who sometimes needs to hear the next few syllables before they can translate the current word — they buffer until meaning is clear.
<|endoftext|>, tool-call markers) must be stripped.
SSE is the standard for streaming LLM responses. Each token is packaged as an SSE message and flushed immediately:
Why SSE over WebSockets? Simpler (HTTP-based, unidirectional), built-in reconnection, works with standard infrastructure. "90% of the benefit with 10% of the headache."
After detokenization, responses pass through: stop condition checking (stop sequences, max tokens, EOS), content filtering (toxicity, safety classifiers), response scoring (reward models), format enforcement (JSON schema validation), and citation linking.
Each turn is a completely separate HTTP request. The SSE connection closes after [DONE]. The server retains no state between turns. The client assembles the full conversation history and sends the entire thing as the prompt in each new request. Every turn gets more expensive — the entire history is re-processed as input tokens (unless prefix caching is available).
SSE connections stay open 10–30 seconds for long responses. Load balancers and proxies often have idle timeouts that kill connections before generation completes. This manifests as mysterious truncated responses — a common production issue that's hard to diagnose without proper timeout configuration.
After generation, the system records: token counts (input, output, cached), latency telemetry (TTFT, TPOT, total), GPU time consumed, and model/parameters used. Revenue recognition differs for per-token vs. per-GPU-hour billing models.
These techniques aren't sequential pipeline steps — they apply across the entire forward pass to reduce memory footprint and distribute computation across GPUs.
LLM inference is memory-bandwidth bound. Smaller weights = less data to transfer = faster inference.
| Method | Bits | Best For | Quality |
|---|---|---|---|
| FP8 (E4M3) | 8 | Hopper GPUs | ~99% |
| AWQ | 4 | GPU inference | ~95% |
| GPTQ | 4 | GPU inference | ~90% |
| GGUF | 2-8 | CPU / edge | ~92% |
AWQ key insight: not all weights are equally important. It identifies salient weights by analysing activation magnitudes and skips them during quantization. Consistently outperforms GPTQ.
FP8 on Hopper GPUs is nearly lossless: 2x performance, 2x memory reduction vs FP16.
Weight matrices are split column-wise or row-wise across GPUs. Each GPU holds a fraction of every layer and computes its slice in parallel. Requires AllReduce after every layer — needs high-bandwidth interconnect (NVLink). Best within a single node.
A 32-layer model with PP=4 assigns layers 0-7 to GPU 0, 8-15 to GPU 1, etc. Communication is only between adjacent stages (point-to-point), much less frequent than TP. Pipeline bubbles (idle stages) are mitigated with micro-batching. Best for scaling across nodes.
Specialised for Mixture of Experts models (DeepSeek-V3: 256 experts, Mixtral: 8). Different experts placed on different GPUs. Requires All-to-All communication to route tokens to correct expert GPU. Since only a fraction of experts activate per token (e.g., 2/64), each GPU does less work while the total model can be massive.
Model weights travel through a 4-level hierarchy on every token generation. The critical bottleneck is HBM → SRAM — even HBM's 3,350 GB/s can't keep pace with GPU compute capacity.
| Level | Capacity | Bandwidth |
|---|---|---|
| NVMe SSD | TBs | ~7 GB/s |
| CPU RAM (PCIe 5.0) | 100s GB | ~64 GB/s |
| HBM3 (H100) | 80 GB | 3,350 GB/s |
| SRAM (on-chip) | ~50 MB | ~19 TB/s |
SRAM holds only 0.036% of a 70B model. Each layer's weights (~1.7 GB) are tiled through SRAM in ~34 chunks, then evicted for the next layer. Cannot keep weights on-chip between tokens.
| Factor | 8B | 405B | Penalty |
|---|---|---|---|
| Weight data/token | ~16 GB | ~810 GB | ~50x |
| GPUs required | 1 | 8-16 | 8-16x cost |
| Communication | Zero | 126 all-reduce ops | Pure penalty |
| Batch size | 64+ | 8-16 | 4-8x less throughput |
| Cost per M tokens | $0.03-0.05 | $0.50-1.00 | 10-20x |
These factors compound: 405B requires ~50x more weight data, split across 8-16 GPUs with 126 synchronisation points per token, and can only batch 8-16 requests vs 64+.
| Layer | Multiplier | Cumulative |
|---|---|---|
| Continuous Batching | ~2.5x | ~750 tok/s |
| PagedAttention | ~1.7x | ~1,275 tok/s |
| FlashAttention | ~1.3x | ~1,658 tok/s |
| Quantization (FP8) | ~1.8x | ~2,984 tok/s |
| Custom CUDA Kernels | ~1.3x | ~3,879 tok/s |
| Speculative Decoding | ~1.3x | ~5,043 tok/s |
15-17x total improvement on identical hardware. This is why the same model can have 10x+ price variance across providers. Custom CUDA kernels are the hardest to replicate — years of accumulated GPU programming expertise form the core moat of inference platforms.
When models are split via tensor parallelism, every forward pass requires GPUs to exchange data via all-reduce at every layer. InfiniBand provides:
RDMA + GPUDirect: GPU → InfiniBand NIC → NIC → GPU, no CPU involved. NVIDIA/Mellanox near-monopoly limits price negotiation.
15-25% of total cluster cost →
Ethernet alternative: 30-50% cheaper with RoCE (RDMA over Converged Ethernet), closing the gap to 2-3x latency difference. Ultra Ethernet Consortium building open standard.
Memory chips stacked 8-12 layers vertically via through-silicon vias (TSVs), placed on the same silicon package as the GPU die.
| Spec | DDR5 | HBM2e (A100) | HBM3 (H100) | HBM3e (H200) |
|---|---|---|---|---|
| Bandwidth | ~50-100 GB/s | ~2,000 GB/s | ~3,350 GB/s | ~4,800 GB/s |
| Capacity/GPU | N/A | 80 GB | 80 GB | 141 GB |
Bandwidth → token speed. Capacity → what fits on one GPU. H200 was a big deal: same compute as H100, but 70B model fits on one GPU instead of two — eliminating communication overhead entirely.
These frameworks implement the full pipeline above. Each makes different trade-offs.
Key innovations: PagedAttention, continuous batching, OpenAI-compatible API. V1 architecture (2025) features separate engine core process for scheduler + KV cache management. Largest community, best model support (~15-20 new models/week).
RadixAttention uses a radix tree to store KV cache prefixes, enabling automatic sharing across requests with partial prefix overlap. Up to 6.4x higher throughput and 3.7x lower latency than vLLM on structured workloads. Best for agents, tool chains, and RAG systems.
Graph-level optimisations, kernel fusion, in-flight batching. Fastest TTFT at low concurrency (35-50ms), but can degrade under high load. Tightly coupled to NVIDIA hardware. Best for ultra-low latency with well-supported models.
Built-in prefill-decode disaggregation, dynamic GPU scheduling, LLM-aware request routing. Up to 30x more requests (DeepSeek-R1 on Blackwell), 2x+ throughput (Llama 70B on Hopper). Dynamo Planner: SLO-driven automation solving the rate-matching challenge between prefill and decode tiers.