Interactive Technical Reference

Anatomy of LLM Inference

Every step from user request to streamed response. Click any stage to drill in.

Scroll to explore
A
Request Preparation
Getting the input ready for the GPU
01 Request Routing 02 Preprocessing 03 Tokenization 04 Embedding & Position
B
GPU Computation
The core forward pass
05 Scheduling & Batching 06 Prefill Phase 07 KV Cache 08 Attention Mechanisms 09 Decode Phase
C
Output & Delivery
Turning logits into a response
10 Sampling & Selection 11 Detokenization & Streaming
The Request Lifecycle

Pipeline Stages

An LLM inference request passes through 11 distinct stages grouped into three phases. Each solves a different problem, has different bottlenecks, and uses different optimisation strategies. Click any stage to explore.

A
Request Preparation
Getting the input ready for the GPU
01 Request Routing Network +
API gateway receives the request, LLM-aware load balancer routes to the optimal GPU worker based on KV cache state, queue depth, and model/LoRA affinity.
Key Takeaway

LLM routing must be GPU-state-aware — traditional load balancers fail because they can't see KV cache utilization, queue depth, or model placement on each worker.

Think of it like...

A restaurant host seating a returning customer at the same table where their appetizers are already waiting, instead of assigning a random empty seat.

Why it matters
Unlike typical web requests, LLM inference is long-running and stateful (due to KV cache). Standard HTTP load balancers fail because they lack awareness of GPU state. Modern LLM-aware routers consider KV cache utilization, queue length, and LoRA adapter presence on each worker.
Drill into specifics
KV-Cache Aware Routing +
Route to GPUs that already hold relevant context

If a user sends a follow-up message in a conversation, the KV cache for previous turns may still reside on a specific GPU. KV-cache aware routing directs the request back to that GPU, skipping prefill for the cached prefix entirely. This can eliminate 95% of TTFT.

Frameworks like llm-d (Kubernetes-native) and vLLM Router (Rust-based) implement this by polling each worker's cache state and using hash-based or prefix-matching lookups.

Prefill-Decode Disaggregation +
Separate GPU pools for prefill vs decode

The defining architecture of 2025. Prefill is compute-bound (parallel matrix ops); decode is memory-bound (sequential KV reads). Running both on the same GPU causes interference — a long prefill blocks decode iterations, spiking latency for in-progress generations.

The solution: dedicated prefill GPUs compute the initial KV cache, then transfer it (via RDMA/NVLink) to dedicated decode GPUs. Each tier scales independently. Meta, Mistral, and Hugging Face run this in production. Gains: 2–7x throughput.

Gateway Frameworks (2025) +
vLLM Router, K8s Gateway API, NVIDIA Dynamo

vLLM Router: Rust-based, lightweight load balancer engineered for vLLM. State-aware, understands prefill/decode disaggregation patterns.

Kubernetes Gateway API Inference Extension: Model-aware routing at the K8s ingress level, supporting per-request criticalities and GPU-specific metrics.

NVIDIA Dynamo: Next-gen distributed inference framework with built-in disaggregation, dynamic GPU scheduling, and LLM-aware request routing. Up to 30x more requests served (DeepSeek-R1 on Blackwell).

02 Preprocessing Logic +
Input validation, prompt template assembly, RAG retrieval, rate limiting. CPU-bound, scales linearly with input length.
Key Takeaway

Prompt construction order matters — static content first, dynamic content last — to maximize prefix cache hit rates in downstream GPU stages.

Think of it like...

A chef's mise en place — washing, chopping, and measuring all ingredients before the stove turns on. No GPU time is used here.

What happens here
Before any GPU work begins, the API server assembles the final prompt on CPU. This includes applying prompt templates (system instructions, chat formatting), performing RAG retrieval (if applicable), validating input constraints (max tokens, stop sequences), and rate limiting. The request metadata (ID, sampling params, timestamp) is prepared for the scheduler.
Key considerations
This stage is entirely CPU-bound and scales with input length. For RAG-heavy workloads, embedding generation and vector search can dominate preprocessing time. Smart prompt construction — placing static content first and dynamic content last — is critical for maximising prefix cache hit rates downstream.
03 Tokenization Logic +
Raw text is converted into integer token IDs via BPE or SentencePiece. Vocabulary sizes have grown 8x in 3 years (32K → 262K).
Key Takeaway

Larger vocabularies produce fewer tokens per input, directly reducing cost and latency — at the trade-off of a bigger embedding table.

Think of it like...

Learning common phrases in a foreign language: 'good morning' becomes one unit instead of eleven letters, so your conversations get shorter and faster.

How BPE works
Byte Pair Encoding starts with individual bytes (256 base tokens) and iteratively merges the most frequent adjacent pair into a new token. This repeats until the desired vocabulary size is reached. Modern LLMs use byte-level BPE, meaning any input — regardless of language or special characters — can be tokenized with zero unknown tokens.
Drill into specifics
Vocabulary Size Trends +
From 32K to 262K in 3 years
ModelYearVocab Size
Llama 2202332,000
Llama 32024128,256
Mistral Nemo2025~131,000
Gemini 32025262,000

Larger vocabularies mean fewer tokens per input (lower cost, faster inference) but larger embedding matrices. There is a log-linear relationship between vocabulary size and training loss.

SentencePiece +
Language-agnostic tokenization without pre-tokenization

SentencePiece treats input as a raw character stream with no pre-tokenization, encoding spaces as the metasymbol . It supports both BPE and Unigram algorithms and handles any language without language-specific preprocessing.

2025 Innovations +
SuperBPE, BoundlessBPE, LiteToken

SuperBPE (COLM 2025): Two-pass BPE that learns cross-word "superword" tokens. Produces 33% fewer tokens and improves performance by 4.0% across 30 benchmarks.

BoundlessBPE: Relaxes word boundary constraints, achieving up to 15% improvement in bytes-per-token.

LiteToken (Feb 2026): Identifies and removes "intermediate merge residues" — tokens frequent during BPE training but rarely used in final output.

Fast Tokenizers +
tiktoken is 3-6x faster than alternatives

tiktoken (OpenAI, Rust core) is the fastest tokenizer at 3–6x faster than alternatives. It is inference-only (no training support) and powers OpenAI models, Llama 3+, and Mistral's Tekken tokenizer.

04 Embedding & Positional Encoding Memory +
Token IDs are mapped to dense vectors via a lookup table, then combined with positional information (RoPE) so the model knows token order.
Key Takeaway

Embeddings give tokens meaning; positional encoding gives them order. Without position, 'dog bites man' and 'man bites dog' look identical to the model.

Think of it like...

Giving each word a GPS coordinate in meaning-space, then stamping it with a sequence number so the model knows what came first.

Token embedding
A lookup table of shape [vocab_size, d_model] maps each token ID to a learned vector. For Llama 2 7B: 32,000 × 4,096 = ~131M parameters. This is a simple table index, not a matrix multiplication. The vectors encode semantic meaning learned during training.
Drill into specifics
RoPE (Rotary Position Embedding) +
The dominant positional encoding in 2025

RoPE encodes position by rotating query and key vectors in 2D subspaces using sinusoidal functions. It is parameter-free, inherently captures relative positions, and scales gracefully to long contexts.

Extensions like YaRN and NTK-aware scaling allow context lengths far beyond training length. Used by LLaMA, Mistral, GPT-NeoX, and most open-weight models.

ALiBi +
Attention with Linear Biases

Instead of modifying embeddings, ALiBi adds a linear bias directly to attention scores based on token distance. It shows better extrapolation beyond the training window and trains faster than RoPE. Used in MPT and some specialized models.

Per-Layer Embeddings (PLE) +
Google Gemma 3N innovation for mobile

Google's Gemma 3N introduced PLE for mobile inference. Rather than one large initial embedding, PLE generates smaller, layer-specific embeddings cached to slower storage (flash memory) and loaded as each layer runs. This dramatically reduces active memory footprint for on-device models.

B
GPU Computation
The core forward pass
05 Scheduling & Batching Logic +
The scheduler decides which requests enter the next GPU iteration. Continuous batching replaces completed sequences instantly, achieving 80-95% GPU utilisation.
Key Takeaway

Continuous batching is the single biggest throughput optimization — it keeps GPUs busy by replacing finished sequences with new ones every iteration.

Think of it like...

A barber who starts the next haircut immediately when a chair opens, instead of waiting for an entire group appointment to finish.

Why batching is critical
Each decode step underutilises the GPU — the bottleneck is reading model weights from memory, not computation. Batching amortises this cost: reading weights once and applying them to many concurrent requests. Without batching, GPU utilisation can be as low as 5-10% during decode.
Drill into specifics
Static Batching +
Fixed groups, wait for slowest request

Requests are grouped into fixed-size batches. The entire batch waits until the slowest request finishes. If one request generates 10 tokens and another generates 500, the short request idles for the long one. GPU utilisation: 30-60%.

Continuous Batching +
Replace finished sequences every iteration

The breakthrough technique. Each sequence finishes independently and is immediately replaced with a new request at every decode iteration. The batch composition changes dynamically.

80-95%
GPU Utilisation
2-8x
Throughput Gain

All major frameworks support it: vLLM, SGLang, TensorRT-LLM ("in-flight batching"), LMDeploy, TGI.

Chunked Prefill +
Interleave prefill with decode steps

Long prompts are split into chunks processed iteratively, interleaved with decode steps. This prevents a single large prefill from blocking all in-progress decode iterations (head-of-line blocking). Critical for maintaining low TPOT under mixed workloads.

06 Prefill Phase Compute +
All input tokens are processed through every Transformer layer in parallel. Computes and stores the KV cache. Determines Time to First Token (TTFT).
Key Takeaway

Prefill processes all input tokens in parallel and is compute-bound. Its speed directly determines Time to First Token (TTFT).

Think of it like...

Reading an entire exam question before writing your answer — you must process everything first, but at least you can read all words simultaneously.

How it works
All input token embeddings (with positional encoding) are fed through the Transformer layers simultaneously. At each attention layer, the model computes Query (Q), Key (K), and Value (V) matrices for every token. Self-attention is computed: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The resulting K and V tensors are stored in the KV cache.
Characteristics
Compute
Bottleneck Type
O(n²)
Attention Complexity
TTFT
Latency Metric
Prefill is compute-bound and highly parallelisable. All tokens are known upfront, so matrix multiplications are fully batched — ideal for GPU utilisation. For a 10K-token prompt, prefill on a single GPU can take seconds. This time directly determines the Time to First Token.
Drill into specifics
Prefix Cache Hit +
Skip prefill entirely for cached prefixes

When the KV cache for a prompt prefix already exists, prefill goes from O(n²) GPU compute to O(n) storage I/O, eliminating 95% of TTFT. vLLM's Automatic Prefix Caching achieves 87%+ cache hit rates with well-structured prompts and 88% faster TTFT for warm cache hits.

KV-Runahead +
Apple's parallel layer prefill (2025)

Apple's KV-Runahead generates KV caches for later layers in parallel while earlier layers are still processing, overlapping computation and reducing total prefill time.

07 KV Cache & Memory Management Memory +
Stores Key and Value tensors from all layers to avoid O(n²) recomputation. Often consumes more GPU memory than the model weights. PagedAttention reduces waste from 60-80% to under 4%.
Key Takeaway

The KV cache trades memory for speed — it avoids recomputing attention for previous tokens, but often consumes more GPU memory than the model weights themselves.

Think of it like...

Keeping sticky notes of every previous conversation turn so you don't re-read the entire chat history — efficient, but your desk fills up fast.

Memory formula
// Per token KV_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_param // Example: Llama 2 7B (FP16) // 2 × 32 × 32 × 128 × 2 = 524 KB per token // 4096 ctx × 524KB = ~2 GB per sequence // Batch of 32 = ~64 GB — can exceed a single GPU!
Drill into specifics
PagedAttention +
OS-style virtual memory for KV cache

Borrows virtual memory and paging from operating systems. GPU memory is divided into fixed-size physical blocks (e.g., 16 tokens each). Each sequence's KV cache maps to logical blocks that point to scattered physical blocks via a block table (like a page table).

Blocks are allocated on demand as tokens are generated, not pre-allocated for max length. Multiple requests sharing a prefix can point to the same physical block (copy-on-write).

<4%
Memory Waste
2-4x
More Concurrent Reqs
KV Cache Compression +
Eviction, quantization, merging

Token-level: Evict unimportant tokens, dynamically allocate memory budget, merge similar KV pairs, quantise cached values to INT8/INT4.

Offloading: Move KV cache to CPU DRAM or disk when GPU memory is full, with intelligent prefetching.

Multi-tier storage: LMCache supports GPU DRAM → CPU DRAM → local disk → remote storage hierarchy.

Multi-Head Latent Attention +
DeepSeek's 93% KV cache reduction

DeepSeek V2/V3's MLA compresses K and V into a low-dimensional latent vector. Only the compressed latent is stored in the KV cache; at inference, it is projected back to full K/V space. Result: 93.3% KV cache reduction vs MHA, slightly outperforming in quality.

08 Attention Mechanisms Compute +
The core computation. GQA is the 2025 standard. FlashAttention makes it IO-efficient by tiling through GPU SRAM (19 TB/s) instead of HBM (2 TB/s).
Key Takeaway

GQA is the 2025 standard: near-full-quality attention with a fraction of the memory cost. FlashAttention makes any variant faster via hardware-aware tiling.

Think of it like...

A study group sharing notes — instead of everyone writing independent copies (MHA), small groups share one set (GQA). Less paper, nearly the same understanding.

Drill into specifics
MHA → GQA → MQA +
The evolution of attention head sharing

MHA (Multi-Head Attention): Original Transformer. Each head has independent Q, K, V. Maximum expressivity, highest KV cache cost.

MQA (Multi-Query Attention): All query heads share one K/V head. KV cache reduced by num_heads× (e.g., 32x). Lower quality.

GQA (Grouped-Query Attention): The 2025 standard. Query heads grouped, each group shares one K/V head. Example: 32 Q heads with 8 KV heads = 4x KV cache reduction. Optimal quality-efficiency balance. Used in LLaMA 3, Mistral, most open models.

FlashAttention +
IO-aware tiled attention: up to 7.6x faster

Not a different attention variant — an IO-aware implementation that makes any attention pattern faster. Core insight: standard attention is bottlenecked by memory bandwidth, not compute.

Solution: Tile Q, K, V into blocks. Load blocks from HBM to SRAM (fast, small on-chip memory). Compute attention entirely in SRAM. Write only the final output back — never materialising the full N×N attention matrix.

19 TB/s
SRAM Bandwidth
2 TB/s
HBM Bandwidth
7.6x
Speedup (GPT-2)

FlashAttention-3 (2025): Adds async loading (overlap data transfer with compute), FP8 support, and Hopper-specific optimisations.

Multi-Head Latent Attention +
DeepSeek's compressed attention

MLA compresses K and V into a low-dimensional latent vector before caching. At inference, the latent is projected back. In "absorb mode": 71x less memory per layer (98.6% reduction). Slightly outperforms MHA in quality.

09 Decode Phase Memory +
Tokens generated one at a time, autoregressively. Each step reads the entire KV cache but does little math. Memory-bandwidth-bound, inherently sequential.
Key Takeaway

Decode generates one token at a time and is memory-bandwidth-bound — the GPU spends most of its time waiting for data reads, not computing.

Think of it like...

Writing a story one word at a time, where for each word you must re-read all your previous notes — the bottleneck isn't thinking, it's flipping through pages.

The autoregressive loop
For each new token: the single embedding is fed through all layers. Only the Query for the new token is computed fresh. The Key and Value are computed and appended to the KV cache. Attention is computed between the new Q and all cached K/V pairs. Output logits represent probability over the vocabulary.
Why it's slow
Each decode step involves reading the entire KV cache from GPU HBM but performs little arithmetic. The arithmetic intensity is very low — the GPU is mostly waiting for memory reads. This is why decode is memory-bandwidth-bound, not compute-bound. The key metric is TPOT (Time Per Output Token).
Drill into specifics
Speculative Decoding +
Draft model + verification: 2-3x speedup, lossless

A small "draft model" (e.g., 1B params) generates K candidate tokens quickly. The large target model (e.g., 70B) verifies all K tokens in a single forward pass (prefill-like parallelism). Tokens are accepted left-to-right; the first rejected token is resampled.

The output distribution is mathematically identical to running the target model alone — this is lossless acceleration.

2-3x
Typical Speedup
0%
Quality Loss

2025 advances: Block verification (5-8% additional speedup), Online Speculative Decoding (adapts draft to query distribution), Self-Speculative Decoding (uses early-exit layers, no separate model), Medusa (parallel draft heads).

C
Output & Delivery
Turning logits into a response
10 Sampling & Token Selection Logic +
Logits are transformed via temperature, truncated (top-k, top-p, min-p), penalised for repetition, then sampled. Order matters.
Key Takeaway

The order of sampling operations matters: penalties → temperature → truncation → softmax → sample. Min-P is the recommended truncation method for 2025.

Think of it like...

Choosing a restaurant: first eliminate closed ones (truncation), adjust for how adventurous you feel (temperature), then pick from what's left.

The sampling pipeline
Raw LogitsRepetition/Penalty AdjustmentsTemperature ScalingTruncation (top-k / top-p / min-p) → SoftmaxRandom Sample
Drill into specifics
Temperature +
Scales logits before softmax

Temperature scales logits: p_i = exp(z_i / T) / Σexp(z_j / T)

T = 1.0: Default. T < 1.0: Sharper, more deterministic. T > 1.0: Flatter, more creative. T → 0: Greedy decoding (always pick highest probability).

Top-K, Top-P, Min-P +
Truncation strategies compared

Top-K: Keep K highest-probability tokens. Problem: fixed K is context-inappropriate.

Top-P (Nucleus): Keep smallest set whose cumulative probability exceeds P. Dynamically adaptive, but coupled to temperature.

Min-P (ICLR 2025): Filter tokens below min_p × max_probability. Consistently outperforms Top-P, especially at higher temperatures. The recommended truncation method for 2025.

Repetition Penalties +
Frequency, presence, and LZ penalties

Repetition penalty: Multiplicative penalty on logits of recently-seen tokens.

Frequency penalty: Additive penalty proportional to token occurrence count.

Presence penalty: Flat additive penalty on any token that has appeared at all (binary).

LZ penalty (2025): Information-theoretic penalty based on Lempel-Ziv complexity, detecting repeated n-gram patterns.

Top-n-Sigma (ACL 2025) +
Temperature-decoupled truncation in logit space

The newest method. Addresses temperature coupling — probability-based truncation produces identical token sets regardless of temperature. Top-n-sigma operates in logit space, keeping tokens within n standard deviations of the mean logit. Fully decoupled from temperature.

11 Detokenization & Streaming IO +
Token IDs are converted back to text and streamed to the client via Server-Sent Events (SSE). Includes content filtering and format enforcement.
Key Takeaway

Tokens can't always be decoded independently — partial characters must be buffered until a valid text boundary is reached before sending to the client.

Think of it like...

A simultaneous translator who sometimes needs to hear the next few syllables before they can translate the current word — they buffer until meaning is clear.

Detokenization subtleties
During streaming, tokens cannot be detokenized independently. Some tokens represent partial UTF-8 characters or subwords that only form valid text when combined. The detokenizer must buffer tokens until a valid text boundary is reached. Special tokens (<|endoftext|>, tool-call markers) must be stripped.
Drill into specifics
Server-Sent Events (SSE) +
The dominant streaming protocol

SSE is the standard for streaming LLM responses. Each token is packaged as an SSE message and flushed immediately:

data: {"choices":[{"delta":{"content":" Hello"}}]} data: {"choices":[{"delta":{"content":" world"}}]} data: [DONE]

Why SSE over WebSockets? Simpler (HTTP-based, unidirectional), built-in reconnection, works with standard infrastructure. "90% of the benefit with 10% of the headache."

Postprocessing Pipeline +
Safety, formatting, stop conditions

After detokenization, responses pass through: stop condition checking (stop sequences, max tokens, EOS), content filtering (toxicity, safety classifiers), response scoring (reward models), format enforcement (JSON schema validation), and citation linking.

Applied Across All Phases

Cross-Cutting Optimizations

These techniques aren't sequential pipeline steps — they apply across the entire forward pass to reduce memory footprint and distribute computation across GPUs.

Quantization Methods +
GPTQ, AWQ, GGUF, FP8

LLM inference is memory-bandwidth bound. Smaller weights = less data to transfer = faster inference.

MethodBitsBest ForQuality
FP8 (E4M3)8Hopper GPUs~99%
AWQ4GPU inference~95%
GPTQ4GPU inference~90%
GGUF2-8CPU / edge~92%

AWQ key insight: not all weights are equally important. It identifies salient weights by analysing activation magnitudes and skips them during quantization. Consistently outperforms GPTQ.

FP8 on Hopper GPUs is nearly lossless: 2x performance, 2x memory reduction vs FP16.

Tensor Parallelism (TP) +
Slice individual layers across GPUs

Weight matrices are split column-wise or row-wise across GPUs. Each GPU holds a fraction of every layer and computes its slice in parallel. Requires AllReduce after every layer — needs high-bandwidth interconnect (NVLink). Best within a single node.

Pipeline Parallelism (PP) +
Divide layers into sequential stages

A 32-layer model with PP=4 assigns layers 0-7 to GPU 0, 8-15 to GPU 1, etc. Communication is only between adjacent stages (point-to-point), much less frequent than TP. Pipeline bubbles (idle stages) are mitigated with micro-batching. Best for scaling across nodes.

Expert Parallelism (EP) +
For MoE models: route experts to GPUs

Specialised for Mixture of Experts models (DeepSeek-V3: 256 experts, Mixtral: 8). Different experts placed on different GPUs. Requires All-to-All communication to route tokens to correct expert GPU. Since only a fraction of experts activate per token (e.g., 2/64), each GPU does less work while the total model can be massive.

Putting It All Together

Serving Frameworks

These frameworks implement the full pipeline above. Each makes different trade-offs.

vLLM +
Most widely adopted. PagedAttention + continuous batching.

Key innovations: PagedAttention, continuous batching, OpenAI-compatible API. V1 architecture (2025) features separate engine core process for scheduler + KV cache management. Largest community, best model support (~15-20 new models/week).

120-160
req/s
50-80ms
TTFT
SGLang +
RadixAttention for automatic prefix sharing.

RadixAttention uses a radix tree to store KV cache prefixes, enabling automatic sharing across requests with partial prefix overlap. Up to 6.4x higher throughput and 3.7x lower latency than vLLM on structured workloads. Best for agents, tool chains, and RAG systems.

TensorRT-LLM +
NVIDIA's optimised library. Fastest at low concurrency.

Graph-level optimisations, kernel fusion, in-flight batching. Fastest TTFT at low concurrency (35-50ms), but can degrade under high load. Tightly coupled to NVIDIA hardware. Best for ultra-low latency with well-supported models.

NVIDIA Dynamo +
Next-gen distributed framework (GTC 2025).

Built-in prefill-decode disaggregation, dynamic GPU scheduling, LLM-aware request routing. Up to 30x more requests (DeepSeek-R1 on Blackwell), 2x+ throughput (Llama 70B on Hopper). Dynamo Planner: SLO-driven automation solving the rate-matching challenge between prefill and decode tiers.

End-to-End Summary

The Complete Journey

Click any step label to view details
User Request | v ── Phase A: Request Preparation ────────────────── | [01 Request Routing] |-- KV cache aware routing |-- Prefill/decode disaggregation v [02 Preprocessing] |-- Input validation, rate limiting |-- Prompt template + RAG retrieval v [03 Tokenization] |-- BPE / SentencePiece -> token IDs |-- tiktoken (3-6x faster) v [04 Embedding + Position] |-- Token ID -> vector (table lookup) |-- + RoPE positional encoding | v ── Phase B: GPU Computation ─────────────────────── | [05 Scheduler Queue] |-- Continuous batching: join next iteration v [06 PREFILL] (compute-bound, parallel) |-- Check prefix cache -> skip if hit |-- All tokens through all layers |-- Store K,V in KV cache (PagedAttention) |-- FlashAttention for IO efficiency v [07-08 KV Cache + Attention] |-- GQA (standard), MLA (DeepSeek) |-- PagedAttention: <4% memory waste v [09 DECODE LOOP] (memory-bound, sequential) |-- Single token through all layers |-- Q attends to cached K,V |-- Speculative decoding: 2-3x speedup | v ── Phase C: Output & Delivery ──────────────────── | [10 Sampling] |-- Penalties -> Temperature -> Min-P -> Softmax -> Sample v [11 Detokenize + Stream] |-- Incremental detokenization |-- SSE streaming -> content filter -> [DONE] v User receives streamed response ───────────────────────────────────────────────────── Cross-cutting: Quantization (FP8/INT4) + Parallelism (TP/PP/EP) Applied across all phases to reduce memory & distribute compute