Anatomy of LLM Inference

The Request Lifecycle

Pipeline Stages

An LLM inference request passes through 11 distinct stages grouped into three phases. Each solves a different problem, has different bottlenecks, and uses different optimisation strategies. Click any stage to explore.

Request Preparation

Getting the input ready for the GPU

01 Request Routing Network +

API gateway receives the request, LLM-aware load balancer routes to the optimal GPU worker based on KV cache state, queue depth, and model/LoRA affinity.

Key Takeaway

LLM routing must be GPU-state-aware — traditional load balancers fail because they can't see KV cache utilization, queue depth, or model placement on each worker.

Think of it like...

A restaurant host seating a returning customer at the same table where their appetizers are already waiting, instead of assigning a random empty seat.

Interactive Visual

Why it matters

Unlike typical web requests, LLM inference is long-running and stateful (due to KV cache). Standard HTTP load balancers fail because they lack awareness of GPU state. Modern LLM-aware routers consider KV cache utilization, queue length, and LoRA adapter presence on each worker.

Drill into specifics

KV-Cache Aware Routing +

Route to GPUs that already hold relevant context

If a user sends a follow-up message in a conversation, the KV cache for previous turns may still reside on a specific GPU. KV-cache aware routing directs the request back to that GPU, skipping prefill for the cached prefix entirely. This can eliminate 95% of TTFT.

Frameworks like llm-d (Kubernetes-native) and vLLM Router (Rust-based) implement this by polling each worker's cache state and using hash-based or prefix-matching lookups.

Prefill-Decode Disaggregation +

Separate GPU pools for prefill vs decode

The defining architecture of 2025. Prefill is compute-bound (parallel matrix ops); decode is memory-bound (sequential KV reads). Running both on the same GPU causes interference — a long prefill blocks decode iterations, spiking latency for in-progress generations.

The solution: dedicated prefill GPUs compute the initial KV cache, then transfer it (via RDMA/NVLink) to dedicated decode GPUs. Each tier scales independently. Meta, Mistral, and Hugging Face run this in production. Gains: 2–7x throughput.

Gateway Frameworks (2025) +

vLLM Router, K8s Gateway API, NVIDIA Dynamo

vLLM Router: Rust-based, lightweight load balancer engineered for vLLM. State-aware, understands prefill/decode disaggregation patterns.

Kubernetes Gateway API Inference Extension: Model-aware routing at the K8s ingress level, supporting per-request criticalities and GPU-specific metrics.

NVIDIA Dynamo: Next-gen distributed inference framework with built-in disaggregation, dynamic GPU scheduling, and LLM-aware request routing. Up to 30x more requests served (DeepSeek-R1 on Blackwell).

Rate Limiting as Pricing Lever +

Not just protection — a business strategy

Per-customer rate limits (requests/sec, tokens/min) map directly to pricing tiers. Queue vs. reject during spikes is a business decision: queuing adds latency but preserves revenue, rejecting loses it. Rate limits are pricing levers, not just technical constraints.

How pricing structures work →

Multi-LoRA Hot-Swapping +

Base model stays loaded, adapters swap per-request

The base model stays resident on GPU while LoRA adapters are swapped per-request. This enables hundreds of fine-tuned variants from a single GPU deployment. Router efficiency — how well requests are matched to loaded adapters — is the most important unit economics driver.

Geo-Aware Routing +

Route to nearest region with available capacity

Multi-region deployments route requests to the nearest data center with available GPU capacity. This reduces network round-trip latency (50-200ms savings for cross-continent requests) and enables failover across regions. Providers like Baseten, Anyscale, and major cloud LLM APIs use Anycast DNS or edge routing to direct traffic to the optimal region based on both proximity and current load.

Cold Start Times +

30-120 seconds for large models

When a model isn't already loaded on a GPU, loading weights from disk takes 30–120 seconds for large models. Hot models are load-balanced across replicas; cold models either incur this latency or require pre-warmed standby GPUs (costly idle capacity).

02 Preprocessing Logic +

Input validation, prompt template assembly, RAG retrieval, rate limiting. CPU-bound, scales linearly with input length.

Key Takeaway

Prompt construction order matters — static content first, dynamic content last — to maximize prefix cache hit rates in downstream GPU stages.

Think of it like...

A chef's mise en place — washing, chopping, and measuring all ingredients before the stove turns on. No GPU time is used here.

Interactive Visual

What happens here

Before any GPU work begins, the API server assembles the final prompt on CPU. This includes applying prompt templates (system instructions, chat formatting), performing RAG retrieval (if applicable), validating input constraints (max tokens, stop sequences), and rate limiting. The request metadata (ID, sampling params, timestamp) is prepared for the scheduler.

Key considerations

This stage is entirely CPU-bound and scales with input length. For RAG-heavy workloads, embedding generation and vector search can dominate preprocessing time. Smart prompt construction — placing static content first and dynamic content last — is critical for maximising prefix cache hit rates downstream.

Drill into specifics

Content Moderation Before GPU +

Filter harmful requests before spending compute

Optional content moderation runs before any GPU spend. A lightweight safety classifier screens requests for harmful content, preventing expensive GPU cycles on requests that would be filtered anyway. This is especially important at scale where abusive traffic can waste significant compute budget.

03 Tokenization Logic +

Raw text is converted into integer token IDs via BPE or SentencePiece. Vocabulary sizes have grown 8x in 3 years (32K → 262K).

Key Takeaway

Larger vocabularies produce fewer tokens per input, directly reducing cost and latency — at the trade-off of a bigger embedding table.

Think of it like...

Learning common phrases in a foreign language: 'good morning' becomes one unit instead of eleven letters, so your conversations get shorter and faster.

Interactive Visual

How BPE works

Byte Pair Encoding starts with individual bytes (256 base tokens) and iteratively merges the most frequent adjacent pair into a new token. This repeats until the desired vocabulary size is reached. Modern LLMs use byte-level BPE, meaning any input — regardless of language or special characters — can be tokenized with zero unknown tokens.

Drill into specifics

Vocabulary Size Trends +

From 32K to 262K in 3 years

Model	Year	Vocab Size
Llama 2	2023	32,000
Llama 3	2024	128,256
Mistral Nemo	2025	~131,000
Gemini 3	2025	262,000

Larger vocabularies mean fewer tokens per input (lower cost, faster inference) but larger embedding matrices. There is a log-linear relationship between vocabulary size and training loss.

SentencePiece +

Language-agnostic tokenization without pre-tokenization

SentencePiece treats input as a raw character stream with no pre-tokenization, encoding spaces as the metasymbol ▁. It supports both BPE and Unigram algorithms and handles any language without language-specific preprocessing.

2025 Innovations +

SuperBPE, BoundlessBPE, LiteToken

SuperBPE (COLM 2025): Two-pass BPE that learns cross-word "superword" tokens. Produces 33% fewer tokens and improves performance by 4.0% across 30 benchmarks.

BoundlessBPE: Relaxes word boundary constraints, achieving up to 15% improvement in bytes-per-token.

LiteToken (Feb 2026): Identifies and removes "intermediate merge residues" — tokens frequent during BPE training but rarely used in final output.

Fast Tokenizers +

tiktoken is 3-6x faster than alternatives

tiktoken (OpenAI, Rust core) is the fastest tokenizer at 3–6x faster than alternatives. It is inference-only (no training support) and powers OpenAI models, Llama 3+, and Mistral's Tekken tokenizer.

Token Economics +

~4 chars/token EN, CJK languages pay 2x more

Tokens are the fundamental unit of cost and latency. Average characters per token vary dramatically by content type:

Chars/Token (English)

Chars/Token (Code)

~1.5

Chars/Token (CJK)

Japanese users pay ~2x more per word due to tokenizer inefficiency. Different models use different tokenizers, so the same text produces different token counts. Most providers charge input and output tokens separately because output tokens (sequential decode) cost more to serve.

→ How tokenizers are trained

04 Embedding & Positional Encoding Memory +

Token IDs are mapped to dense vectors via a lookup table, then combined with positional information (RoPE) so the model knows token order.

Key Takeaway

Embeddings give tokens meaning; positional encoding gives them order. Without position, 'dog bites man' and 'man bites dog' look identical to the model.

Think of it like...

Giving each word a GPS coordinate in meaning-space, then stamping it with a sequence number so the model knows what came first.

Interactive Visual

Token embedding

A lookup table of shape [vocab_size, d_model] maps each token ID to a learned vector. For Llama 2 7B: 32,000 × 4,096 = ~131M parameters. This is a simple table index, not a matrix multiplication. The vectors encode semantic meaning learned during training.

Drill into specifics

RoPE (Rotary Position Embedding) +

The dominant positional encoding in 2025

RoPE encodes position by rotating query and key vectors in 2D subspaces using sinusoidal functions. It is parameter-free, inherently captures relative positions, and scales gracefully to long contexts.

Extensions like YaRN and NTK-aware scaling allow context lengths far beyond training length. Used by LLaMA, Mistral, GPT-NeoX, and most open-weight models.

ALiBi +

Attention with Linear Biases

Instead of modifying embeddings, ALiBi adds a linear bias directly to attention scores based on token distance. It shows better extrapolation beyond the training window and trains faster than RoPE. Used in MPT and some specialized models.

Per-Layer Embeddings (PLE) +

Google Gemma 3N innovation for mobile

Google's Gemma 3N introduced PLE for mobile inference. Rather than one large initial embedding, PLE generates smaller, layer-specific embeddings cached to slower storage (flash memory) and loaded as each layer runs. This dramatically reduces active memory footprint for on-device models.

→ How model architectures are designed

GPU Computation

The core forward pass

05 Scheduling & Batching Logic +

The scheduler decides which requests enter the next GPU iteration. Continuous batching replaces completed sequences instantly, achieving 80-95% GPU utilisation.

Key Takeaway

Continuous batching is the single biggest throughput optimization — it keeps GPUs busy by replacing finished sequences with new ones every iteration.

Think of it like...

A city bus where passengers board and exit at different stops — when someone gets off, that seat opens immediately for a new rider waiting at the next stop.

Interactive Visual

Why batching is critical

With no batching, each request processes alone — GPU compute utilisation sits at just 2–5% because the GPU spends most time loading weights for a single sequence. Batching amortises this cost: reading weights once and applying them to many concurrent requests. At batch size 32, utilisation reaches 60–80% — same weights, dramatically more work per byte loaded.
Utilization drives unit economics →

Drill into specifics

Static Batching +

Fixed groups, wait for slowest request

Requests are grouped into fixed-size batches. The entire batch waits until the slowest request finishes. If one request generates 10 tokens and another generates 500, the short request idles for the long one. Short sequences are padded to match the longest, wasting compute on empty tokens. GPU utilisation: 30-60%.

Dynamic Batching +

Time or size trigger, still waits for slowest

A middle ground between static and continuous. A batch starts when either a max batch size is reached or a timeout expires — whichever comes first. This reduces queuing latency compared to static batching (no need to wait for a full batch), but the batch still runs as a unit: all requests finish together, so shorter sequences sit idle while the longest completes.

40-70%

GPU Utilisation

~1.5x

vs Static Throughput

Most appropriate for interactive workloads like image generation where requests arrive unevenly and latency matters more than maximum throughput.

Continuous Batching +

Replace finished sequences every iteration

The breakthrough technique. Each sequence finishes independently and is immediately replaced with a new request at every decode iteration. The batch composition changes dynamically.

80-95%

GPU Utilisation

2-8x

Throughput Gain

All major frameworks support it: vLLM, SGLang, TensorRT-LLM ("in-flight batching"), LMDeploy, TGI.

Chunked Prefill +

Interleave prefill with decode steps

Long prompts are split into chunks processed iteratively, interleaved with decode steps. This prevents a single large prefill from blocking all in-progress decode iterations (head-of-line blocking). Critical for maintaining low TPOT under mixed workloads.

Interleaved Thinking Scheduling +

Reasoning, tool calls, and answer tokens share one decode loop

Reasoning models now run mixed workloads inside one turn: internal reasoning tokens, tool calls, and answer generation can alternate. Schedulers must preserve per-request state through these micro-phases instead of treating each request as a single uninterrupted decode stream.

Operational implication: if long reasoning segments monopolize decode slots, tail latency spikes for everyone else. Continuous batching plus chunked prefill keeps non-reasoning and reasoning traffic coexisting without catastrophic head-of-line blocking.

06 Prefill Phase Compute +

All input tokens are processed through every Transformer layer in parallel. Computes and stores the KV cache. Determines Time to First Token (TTFT).

Key Takeaway

Prefill processes all input tokens in parallel and is compute-bound. Its speed directly determines Time to First Token (TTFT).

Think of it like...

Reading an entire exam question before writing your answer — you must process everything first, but at least you can read all words simultaneously.

Interactive Visual

TTFT benchmarks

50-200ms

8B Model, Short Prompt

200-800ms

70B Model, Short Prompt

2-10s

70B, 100K+ Context

How it works

All input token embeddings (with positional encoding) are fed through the Transformer layers simultaneously. At each attention layer, the model computes Query (Q), Key (K), and Value (V) matrices for every token. Self-attention is computed: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The resulting K and V tensors are stored in the KV cache.

Characteristics

Compute

Bottleneck Type

O(n²)

Attention Complexity

TTFT

Latency Metric

Prefill is compute-bound and highly parallelisable. All tokens are known upfront, so matrix multiplications are fully batched — ideal for GPU utilisation. For a 10K-token prompt, prefill on a single GPU can take seconds. This time directly determines the Time to First Token.

Drill into specifics

Prefix Cache Hit +

Skip prefill entirely for cached prefixes

When the KV cache for a prompt prefix already exists, prefill goes from O(n²) GPU compute to O(n) storage I/O, eliminating 95% of TTFT. vLLM's Automatic Prefix Caching achieves 87%+ cache hit rates with well-structured prompts and 88% faster TTFT for warm cache hits.

KV-Runahead +

Apple's parallel layer prefill (2025)

Apple's KV-Runahead generates KV caches for later layers in parallel while earlier layers are still processing, overlapping computation and reducing total prefill time.

Disaggregated Prefill/Decode +

Different hardware for each phase

Run prefill on compute-optimised hardware and decode on memory-bandwidth-optimised hardware. Prefill GPUs handle the heavy parallel matrix multiplications, then transfer the KV cache to decode GPUs via RDMA. Each tier scales independently, improving both TTFT and throughput for high-volume, latency-sensitive workloads.

07 KV Cache & Memory Management Memory +

Stores Key and Value tensors from all layers to avoid O(n²) recomputation. Often consumes more GPU memory than the model weights. PagedAttention reduces waste from 60-80% to under 4%.

Key Takeaway

The KV cache trades memory for speed — it avoids recomputing attention for previous tokens, but often consumes more GPU memory than the model weights themselves.

Think of it like...

Keeping sticky notes of every previous conversation turn so you don't re-read the entire chat history — efficient, but your desk fills up fast.

Interactive Visual

Memory formula

// Per token
KV_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_param

// Example: Llama 2 7B (FP16)
// 2 × 32 × 32 × 128 × 2 = 524 KB per token
// 4096 ctx × 524KB = ~2 GB per sequence
// Batch of 32 = ~64 GB — can exceed a single GPU!

Drill into specifics

PagedAttention +

OS-style virtual memory for KV cache

Borrows virtual memory and paging from operating systems. GPU memory is divided into fixed-size physical blocks (e.g., 16 tokens each). Each sequence's KV cache maps to logical blocks that point to scattered physical blocks via a block table (like a page table).

Blocks are allocated on demand as tokens are generated, not pre-allocated for max length. Multiple requests sharing a prefix can point to the same physical block (copy-on-write).

<4%

Memory Waste

2-4x

More Concurrent Reqs

KV Cache Compression +

Eviction, quantization, merging

Token-level: Evict unimportant tokens, dynamically allocate memory budget, merge similar KV pairs, quantise cached values to INT8/INT4.

Offloading: Move KV cache to CPU DRAM or disk when GPU memory is full, with intelligent prefetching.

Multi-tier storage: LMCache supports GPU DRAM → CPU DRAM → local disk → remote storage hierarchy.

Multi-Head Latent Attention +

DeepSeek's 93% KV cache reduction

DeepSeek V2/V3's MLA compresses K and V into a low-dimensional latent vector. Only the compressed latent is stored in the KV cache; at inference, it is projected back to full K/V space. Result: 93.3% KV cache reduction vs MHA, slightly outperforming in quality.

Cache Persistence Between Requests +

Stateless default vs. prefix caching

Naive stateless (default): KV cache discarded after each request. Simple but wasteful — every turn re-prefills the entire conversation history.

Prefix caching (optimised): Server keeps recent KV caches in GPU memory. When the next turn arrives with an identical prefix, only new tokens are prefilled. Classic cache eviction problem — LRU or TTL-based eviction policies decide what stays.

Routing complexity: KV cache lives on specific GPUs. The next request must route to the same GPU(s), creating session affinity and potential hot spots. If the user edits a previous message, cached KV is stale and must be invalidated.

08 Attention Mechanisms Compute +

The core computation. GQA is the 2025 standard. FlashAttention makes it IO-efficient by tiling through GPU SRAM (19 TB/s) instead of HBM (2 TB/s).

Key Takeaway

GQA is the 2025 standard: near-full-quality attention with a fraction of the memory cost. FlashAttention makes any variant faster via hardware-aware tiling.

Think of it like...

A study group sharing notes — instead of everyone writing independent copies (MHA), small groups share one set (GQA). Less paper, nearly the same understanding.

Interactive Visual

Drill into specifics

MHA → GQA → MQA +

The evolution of attention head sharing

MHA (Multi-Head Attention): Original Transformer. Each head has independent Q, K, V. Maximum expressivity, highest KV cache cost.

MQA (Multi-Query Attention): All query heads share one K/V head. KV cache reduced by num_heads× (e.g., 32x). Lower quality.

GQA (Grouped-Query Attention): The 2025 standard. Query heads grouped, each group shares one K/V head. Example: 32 Q heads with 8 KV heads = 4x KV cache reduction. Optimal quality-efficiency balance. Used in LLaMA 3, Mistral, most open models.

FlashAttention +

IO-aware tiled attention: up to 7.6x faster

Not a different attention variant — an IO-aware implementation that makes any attention pattern faster. Core insight: standard attention is bottlenecked by memory bandwidth, not compute.

Solution: Tile Q, K, V into blocks. Load blocks from HBM to SRAM (fast, small on-chip memory). Compute attention entirely in SRAM. Write only the final output back — never materialising the full N×N attention matrix.

19 TB/s

SRAM Bandwidth

2 TB/s

HBM Bandwidth

7.6x

Speedup (GPT-2)

FlashAttention-3 (2025): Adds async loading (overlap data transfer with compute), FP8 support, and Hopper-specific optimisations.

Multi-Head Latent Attention +

DeepSeek's compressed attention

MLA compresses K and V into a low-dimensional latent vector before caching. At inference, the latent is projected back. In "absorb mode": 71x less memory per layer (98.6% reduction). Slightly outperforms MHA in quality.

09 Decode Phase Memory +

Tokens generated one at a time, autoregressively. Each step reads the entire KV cache but does little math. Memory-bandwidth-bound, inherently sequential.

Key Takeaway

Decode generates one token at a time and is memory-bandwidth-bound — the GPU spends most of its time waiting for data reads, not computing.

Think of it like...

Writing a story one word at a time, where for each word you must re-read all your previous notes — the bottleneck isn't thinking, it's flipping through pages.

Interactive Visual

The autoregressive loop

The input to each decode step is the token sampled in the previous step. For each layer: (1) Q, K, and V vectors are computed by multiplying against that layer's weight matrices — this weight loading from HBM dominates each layer's time. (2) K and V are immediately appended to that layer's KV cache. (3) Q attends against all K's in the cache (including the K just added), producing a weighted sum of all V's. (4) The result passes through the feedforward network. Repeat for every subsequent layer.

After the final layer, the model outputs logits — a probability score for every token in the vocabulary (e.g., 128K scores for Llama 3). Sampling selects one token, which is streamed to the user and becomes the input for the next decode step.

Key subtlety: there is always a one-step offset — a token's KV vectors are computed and cached during the next decode step's forward pass, not during the step that produced it.

Why it's slow

Each decode step involves reading the entire KV cache from GPU HBM but performs little arithmetic. The arithmetic intensity is very low — the GPU is mostly waiting for memory reads. This is why decode is memory-bandwidth-bound, not compute-bound. The key metric is TPOT (Time Per Output Token).

Reasoning Modes in 2026 APIs

Reasoning-capable APIs can emit a separate reasoning channel (often exposed as reasoning_content) while generating user-visible output. Providers expose knobs like reasoning effort and thinking budget to trade off quality, latency, and cost.

Interleaved thinking keeps reasoning and tool execution within one turn. Preserved thinking carries reasoning context across turns via controls like reasoning history (or equivalent clear-thinking flags).

Drill into specifics

Speculative Decoding +

Draft model + verification: 2-3x speedup, lossless

A small "draft model" (e.g., 1B params) generates K candidate tokens quickly. The large target model (e.g., 70B) verifies all K tokens in a single forward pass (prefill-like parallelism). Tokens are accepted left-to-right; the first rejected token is resampled.

The output distribution is mathematically identical to running the target model alone — this is lossless acceleration.

2-3x

Typical Speedup

Quality Loss

2025 advances: Block verification (5-8% additional speedup), Online Speculative Decoding (adapts draft to query distribution), Self-Speculative Decoding (uses early-exit layers, no separate model), Medusa (parallel draft heads).

Arithmetic Intensity Problem +

GPU has ~990 TFLOPS but can barely use them

Arithmetic intensity = compute operations ÷ bytes loaded. During decode, this ratio is abysmal: the GPU loads the entire weight matrix for a single matrix-vector multiply. For 70B at FP16, that's loading ~140 GB from HBM at 3.35 TB/s → ~42ms minimum per token → ~24 tokens/sec. The GPU's ~990 TFLOPS sit mostly idle, starved for data.

Batching fixes this: loading weights once and processing 32 requests turns the vector ops into efficient matrix ops. Same bandwidth cost, 32x more useful compute.

GPU Memory Hierarchy +

Disk → RAM → HBM → SRAM → Compute

For every output token, the GPU processes layer by layer:

Level	Size	Speed	Role
NVMe Disk	TBs	~7 GB/s	Weight storage (cold start)
CPU RAM	100s GB	~64 GB/s	Transit to GPU
HBM	80-192 GB	3,350 GB/s	Weights + KV cache (hot)
SRAM	~50 MB	~19 TB/s	On-chip compute cache

SRAM is 5.7x faster than HBM but holds only ~0.036% of a 70B model. Each layer's weights must be tiled through SRAM in ~34 chunks, then evicted for the next layer. Prefetching overlaps loading Layer N+1 while computing Layer N.

Output & Delivery

Turning logits into a response

10 Sampling & Token Selection Logic +

Logits are transformed via temperature, truncated (top-k, top-p, min-p), penalised for repetition, then sampled. Order matters.

Key Takeaway

The order of sampling operations matters: penalties → temperature → truncation → softmax → sample. Min-P is the recommended truncation method for 2025.

Think of it like...

Choosing a restaurant: first eliminate closed ones (truncation), adjust for how adventurous you feel (temperature), then pick from what's left.

Interactive Visual

The sampling pipeline

Raw Logits
  → Repetition/Penalty Adjustments
  → Temperature Scaling
  → Truncation (top-k / top-p / min-p)
  → Softmax
  → Random Sample

Drill into specifics

Temperature +

Scales logits before softmax

Temperature scales logits: p_i = exp(z_i / T) / Σexp(z_j / T)

T = 1.0: Default. T < 1.0: Sharper, more deterministic. T > 1.0: Flatter, more creative. T → 0: Greedy decoding (always pick highest probability).

Top-K, Top-P, Min-P +

Truncation strategies compared

Top-K: Keep K highest-probability tokens. Problem: fixed K is context-inappropriate.

Top-P (Nucleus): Keep smallest set whose cumulative probability exceeds P. Dynamically adaptive, but coupled to temperature.

Min-P (ICLR 2025): Filter tokens below min_p × max_probability. Consistently outperforms Top-P, especially at higher temperatures. The recommended truncation method for 2025.

Repetition Penalties +

Frequency, presence, and LZ penalties

Repetition penalty: Multiplicative penalty on logits of recently-seen tokens.

Frequency penalty: Additive penalty proportional to token occurrence count.

Presence penalty: Flat additive penalty on any token that has appeared at all (binary).

LZ penalty (2025): Information-theoretic penalty based on Lempel-Ziv complexity, detecting repeated n-gram patterns.

Top-n-Sigma (ACL 2025) +

Temperature-decoupled truncation in logit space

The newest method. Addresses temperature coupling — probability-based truncation produces identical token sets regardless of temperature. Top-n-sigma operates in logit space, keeping tokens within n standard deviations of the mean logit. Fully decoupled from temperature.

Constrained Decoding (FSM) +

Force valid JSON/schema output via grammar tracking

Uses finite state machines to mask invalid tokens at each decode step. The FSM tracks position in the output grammar (e.g., "just opened a JSON key string — only valid characters or closing quote allowed") and zeros out tokens that would produce invalid output.

Actually uses pushdown automata (FSMs with a stack) to handle nested structures like JSON braces/brackets. Performance: O(1) per token — microseconds of overhead. After filtering, remaining tokens are renormalized to sum to 1, subtly concentrating probability mass.

11 Detokenization & Streaming IO +

Token IDs are converted back to text and streamed to the client via Server-Sent Events (SSE). Includes content filtering and format enforcement.

Key Takeaway

Tokens can't always be decoded independently — partial characters must be buffered until a valid text boundary is reached before sending to the client.

Think of it like...

A simultaneous translator who sometimes needs to hear the next few syllables before they can translate the current word — they buffer until meaning is clear.

Interactive Visual

Detokenization subtleties

During streaming, tokens cannot be detokenized independently. Some tokens represent partial UTF-8 characters or subwords that only form valid text when combined. The detokenizer must buffer tokens until a valid text boundary is reached. Special tokens (<|endoftext|>, tool-call markers) must be stripped.

Drill into specifics

Server-Sent Events (SSE) +

The dominant streaming protocol

SSE is the standard for streaming LLM responses. Each token is packaged as an SSE message and flushed immediately:

data: {"choices":[{"delta":{"content":" Hello"}}]}
data: {"choices":[{"delta":{"content":" world"}}]}
data: [DONE]

Why SSE over WebSockets? Simpler (HTTP-based, unidirectional), built-in reconnection, works with standard infrastructure. "90% of the benefit with 10% of the headache."

Postprocessing Pipeline +

Safety, formatting, stop conditions

After detokenization, responses pass through: stop condition checking (stop sequences, max tokens, EOS), content filtering (toxicity, safety classifiers), response scoring (reward models), format enforcement (JSON schema validation), and citation linking.

Multi-Turn Statefulness +

Each turn re-sends full conversation history

Each turn is a completely separate HTTP request. The SSE connection closes after [DONE]. The server retains no state between turns. The client assembles the full conversation history and sends the entire thing as the prompt in each new request. Every turn gets more expensive — the entire history is re-processed as input tokens (unless prefix caching is available).

LB Idle Timeouts +

Load balancers can kill long generations

SSE connections stay open 10–30 seconds for long responses. Load balancers and proxies often have idle timeouts that kill connections before generation completes. This manifests as mysterious truncated responses — a common production issue that's hard to diagnose without proper timeout configuration.

Usage Tracking +

Token counts, latency, GPU time, billing

After generation, the system records: token counts (input, output, cached), latency telemetry (TTFT, TPOT, total), GPU time consumed, and model/parameters used. Revenue recognition differs for per-token vs. per-GPU-hour billing models.

What End Users Measure

Inference Metrics

The numbers that matter when evaluating LLM inference quality, cost, and reliability — from latency SLOs to GPU utilization.

Latency Metrics +

TTFT, TPOT, ITL, E2E latency, queue time

Metric	What It Measures	Typical Target
TTFT	Time from request to first token. Dominated by prefill compute + queue wait time.	<200ms (8B), <800ms (70B)
TPOT	Average time between consecutive output tokens during decode.	<30ms (streaming feel)
Reasoning Time Share	Fraction of decode wall time spent on internal reasoning tokens before/alongside visible output. Controlled by reasoning-effort and budget settings.	Use policy caps per tier
ITL	Inter-Token Latency — actual time between each pair of consecutive tokens. Unlike TPOT (average), ITL captures variance and jitter from batching interference.	p99 <50ms
E2E Latency	Total wall-clock time from request send to final token received. E2E = TTFT + (output_tokens × TPOT) + network overhead.	Varies by use case
Queue Time	Time spent waiting before prefill begins. Spikes when GPU capacity is saturated. Two queues: prefill queue and generation queue.	<50ms at p99

Key Insight

TTFT and TPOT measure different bottlenecks: TTFT is compute-bound (prefill), TPOT is memory-bandwidth-bound (decode). Optimizing one often trades off against the other — batching more requests improves throughput but increases individual TTFT.

Throughput Metrics +

Tokens/sec, requests/sec, concurrent requests

Metric	What It Measures	Why It Matters
Tokens/sec (Output)	Total output tokens generated per second across all concurrent requests on a deployment.	The primary capacity metric. Directly determines revenue per GPU-hour at per-token pricing.
Visible vs Reasoning Tokens	Split between user-visible tokens and internal reasoning tokens. Both consume decode capacity even when only one is shown.	Required for reasoning workload capacity planning.
Tokens/sec (Prompt)	Input tokens processed per second. Higher during prefill bursts. Includes cached prompt tokens processed from prefix cache hits.	Shows prefill throughput and prefix cache effectiveness.
Requests/sec	Completed inference requests per second. A function of concurrent batch size and average generation length.	Key for capacity planning and autoscaling triggers.
Concurrent Requests	Number of in-flight requests being processed simultaneously. Bounded by KV cache memory.	Directly correlates with GPU utilization and revenue.

Throughput vs latency tradeoff: increasing batch size (more concurrent requests) improves tokens/sec but increases per-request TTFT and ITL. The art of inference optimization is finding the batch size that maximizes throughput while meeting latency SLOs.

Resource Utilization +

GPU utilization, KV cache, memory, model forward time

Metric	What It Measures	Healthy Range
GPU Compute Utilization	Fraction of GPU FLOPS actually used. Low during decode (memory-bound), high during prefill (compute-bound).	40-70% sustained
KV Cache Utilization	Fraction of allocated KV cache blocks/slots in use. When 100% is reached, new requests must wait (queue) or be rejected.	70-90% (headroom for bursts)
HBM Usage	Total GPU memory consumed: model weights + KV cache + activations + CUDA overhead.	<90% of HBM capacity
Model Forward Time	Duration of the actual model forward pass (excluding scheduling and communication overhead).	Varies by model size
Prefix Cache TTL	How long cached prompt KV states remain available for reuse. Longer TTL = more cache hits = lower TTFT for repeated prefixes.	Deployment-dependent

KV cache utilization is the most actionable metric: it determines how many concurrent requests you can serve. When KV cache fills up, you’re either dropping requests or queuing them — both hurt revenue or SLOs.

Reliability & SLOs +

Error rates, percentiles, availability, SLO frameworks

Metric	What It Measures	Typical SLO
Error Rate	Fraction of requests returning HTTP 4xx/5xx. Broken down by error type: rate-limited (429), server error (500), timeout (504), context too long (413).	<0.1% for 5xx
p50/p95/p99 Latency	Latency at the 50th, 95th, and 99th percentile. p99 captures tail latency that averages hide — one slow request in 100 still frustrates users.	p99 TTFT <2x median
Availability	Percentage of time the endpoint returns successful responses. Downtime includes both outages and periods of >100% error rate.	99.9% (8.7h/yr downtime)
Goodput	Successful tokens generated per second — throughput minus wasted work (errors, preempted requests, speculative decoding rejections).	Track ratio to raw throughput

SLO Framework

Production inference SLOs typically specify: (1) TTFT p99 < X ms for responsiveness, (2) TPOT p99 < Y ms for streaming smoothness, (3) Error rate < Z% for reliability, and (4) Availability > 99.9%. These four metrics together define the quality of service contract. NVIDIA Dynamo’s “SLO-driven” scheduling automatically adjusts batching to meet these targets.

Cost & Efficiency Metrics +

Cost per token, GPU-hours per request, cache hit rate

Metric	What It Measures	Why It Matters
Cost per 1M Tokens	Total infrastructure cost divided by tokens served. The fundamental unit economics metric.	Determines gross margin at per-token pricing.
Reasoning Tokens / Response	Average hidden reasoning token volume per completed request. For reasoning models, this can dominate decode cost if uncapped.	Core guardrail for reasoning tiers.
GPU-hours per Request	Compute time consumed per inference request. Varies 100x+ depending on prompt length and output tokens.	Base for per-request cost accounting.
Prefix Cache Hit Rate	Fraction of prompt tokens served from KV cache vs. recomputed from scratch. Higher = lower TTFT and lower compute cost.	System prompts and multi-turn conversations benefit most.
Tokens per GPU-hour	Total output tokens generated per hour of GPU time. The supply-side efficiency metric.	Determines how many customers one GPU can serve.

→ How throughput optimization creates 3-5x cost variance

Observability in Practice +

Prometheus, Grafana, alerting, and provider dashboards

Production inference deployments export metrics via Prometheus endpoints and visualize with Grafana dashboards. Key metric categories from providers like Fireworks AI:

Rate metrics (counters, per-second): request rate, error rate by HTTP status, prompt tokens/sec, cached tokens/sec.

Latency histograms (buckets): TTFT distribution, TPOT distribution, queue time distribution, overall E2E latency. Histograms enable computing arbitrary percentiles (p50, p95, p99) without pre-aggregation.

Resource gauges (point-in-time): KV cache block utilization, KV cache slot utilization, concurrent request count, model forward time, prefix cache TTL.

Token distributions: generated tokens per request and prompt tokens per request — essential for understanding workload mix and capacity planning.

Reasoning channel telemetry: count and rate of reasoning chunks (for APIs that stream reasoning_content), plus visible/reasoning token splits. Without this split, reasoning tiers look profitable until decode saturation appears.

Alerting best practice: alert on SLO burn rate (error budget consumption speed) rather than raw thresholds. A brief latency spike that stays within your monthly error budget doesn’t need to wake someone up at 3 AM.

Enterprise compliance: Production platforms like Fireworks AI maintain SOC2 Type II, HIPAA, and GDPR certifications. Their observable surface spans 100+ hosted models with structured outputs, function calling, vision, and audio — each generating distinct telemetry streams that must be captured, retained, and auditable under compliance frameworks.

Method	Bits	Best For	Quality
FP8 (E4M3)	8	Hopper GPUs	~99%
AWQ	4	GPU inference	~95%
GPTQ	4	GPU inference	~90%
GGUF	2-8	CPU / edge	~92%

Factor	8B	405B	Penalty
Weight data/token	~16 GB	~810 GB	~50x
GPUs required	1	8-16	8-16x cost
Communication	Zero	126 all-reduce ops	Pure penalty
Batch size	64+	8-16	4-8x less throughput
Cost per M tokens	$0.03-0.05	$0.50-1.00	10-20x

Layer	Multiplier	Cumulative
Continuous Batching	~2.5x	~750 tok/s
PagedAttention	~1.7x	~1,275 tok/s
FlashAttention	~1.3x	~1,658 tok/s
Quantization (FP8)	~1.8x	~2,984 tok/s
Custom CUDA Kernels	~1.3x	~3,879 tok/s
Speculative Decoding	~1.3x	~5,043 tok/s

Spec	DDR5	HBM2e (A100)	HBM3 (H100)	HBM3e (H200)
Bandwidth	~50-100 GB/s	~2,000 GB/s	~3,350 GB/s	~4,800 GB/s
Capacity/GPU	N/A	80 GB	80 GB	141 GB

Anatomy of LLM Inference

Pipeline Stages

Inference Metrics

Cross-Cutting Optimizations

Serving Frameworks

The Complete Journey

Level	Capacity	Bandwidth
NVMe SSD	TBs	~7 GB/s
CPU RAM (PCIe 5.0)	100s GB	~64 GB/s
HBM3 (H100)	80 GB	3,350 GB/s
SRAM (on-chip)	~50 MB	~19 TB/s