Anatomy of LLM Training

G1 Data Pipeline I/O +

Collect, filter, and deduplicate terabytes of text — data quality determines the ceiling of model capability.

Key Takeaway

DCLM showed keeping only the top 10% of data by quality beats training on everything. Data curation is the single highest-leverage investment in model quality.

Think of it like…

A chef sourcing ingredients — the finest technique can’t save a dish made from spoiled food. Training data is the raw ingredient of intelligence.

Data Collection

Modern training corpora start from web crawls. CommonCrawl provides petabytes of raw HTML. Curated derivatives include FineWeb (15T tokens), DCLM (240T raw → 3.8T curated tokens), and RedPajama. Additional sources: books, scientific papers (~5%), code from GitHub/StackOverflow (~15%), and Wikipedia.

Quality Filtering

Heuristic filters: remove documents by length (too short = low content), perplexity (too high = garbled text), PII patterns, and URL blocklists. Classifier-based filtering: train a fastText model on high-quality vs. low-quality examples, then score and threshold every document. DCLM used a 1.4B-parameter classifier trained on OpenHermes 2.5 to keep only top-quality text.

Deduplication

Duplicate content wastes compute and can cause memorization. MinHash LSH (Locality-Sensitive Hashing) efficiently finds near-duplicate documents by approximating Jaccard similarity. SimHash provides a faster alternative for exact-duplicate detection. Exact substring dedup (suffix arrays) catches boilerplate repeated across many pages.

Data Mixing

The final training mixture balances domains: typically ~70% web text, 15% code, 5% scientific, 5% books, 5% other (conversation, math, encyclopedic). Optimal mixing ratios are determined empirically by training small proxy models on different mixes and measuring downstream task performance.

Scale

15T

FineWeb Tokens

3.8T

DCLM Curated

15T

Llama 3 Training

10%

DCLM Keep Rate

G2 Tokenizer Training Logic +

Build the vocabulary that converts text into the integer sequences the model actually processes.

Key Takeaway

Vocabulary size is a fundamental trade-off: larger vocabularies compress text better (fewer tokens per sentence) but increase embedding table size and can fragment rare words.

Think of it like…

Building a dictionary before learning a language. The tokenizer decides which “words” the model will think in — too few and it stutters, too many and it wastes memory on rare terms.

BPE Algorithm

Byte-Pair Encoding starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, common words become single tokens while rare words decompose into subword pieces. GPT-4 uses tiktoken (100K vocabulary); Llama 3 uses SentencePiece (128K vocabulary, up from 32K in Llama 2).

Unigram Model

The Unigram approach works in reverse: start with a very large vocabulary, then iteratively prune tokens that contribute least to the training corpus likelihood. SentencePiece supports both BPE and Unigram. Unigram tends to produce more linguistically meaningful subwords.

Vocabulary Sizes

Model	Vocab Size	Method
GPT-4	100,256	BPE (tiktoken)
Llama 3	128,256	BPE (SentencePiece)
Llama 2	32,000	BPE (SentencePiece)
Claude 3	~100,000	BPE
Gemini	256,000	SentencePiece

→ How tokenization works during inference

G3 Model Architecture Compute +

The 2024 consensus: decoder-only transformer with RMSNorm, RoPE, SwiGLU, and GQA.

Key Takeaway

Modern LLM architecture has converged on a small set of proven components. Innovation now happens at the edges: MoE for efficiency, MLA for KV cache compression, and deeper attention alternatives.

Attention Variants

MHA → MQA → GQA +

From full key-value heads to grouped sharing

Multi-Head Attention (MHA): each head has its own Q, K, V projections. Full expressivity but large KV cache (proportional to number of heads).

Multi-Query Attention (MQA): all heads share a single K and V. Dramatically reduces KV cache but can hurt quality.

Grouped-Query Attention (GQA): compromise — heads are grouped, each group shares K/V. Llama 3 uses 8 KV heads for 64 query heads (8:1 ratio). Near-MHA quality with near-MQA efficiency.

Multi-head Latent Attention +

DeepSeek’s 28x KV cache reduction

MLA (DeepSeek V2/V3) compresses keys and values into a low-rank latent space before caching. Instead of storing full K/V tensors, it stores a compressed representation and reconstructs K/V on the fly during attention.

Result: 28x reduction in KV cache size compared to standard MHA, with minimal quality loss. This enables much larger batch sizes during inference.

Feed-Forward Network

SwiGLU replaces the standard 2-matrix FFN (up-project → ReLU → down-project) with a 3-matrix gated design: two parallel up-projections, one gated by SiLU activation, then multiplied element-wise before the down-projection. Intermediate dimension is typically (8/3) × d_model to keep parameter count comparable.

Stability Innovations

QK-Norm: normalize query and key vectors before the dot product to prevent attention logit explosion at scale. Logit capping: clip pre-softmax logits to a maximum value (e.g., 30.0 in Gemma 2). Z-loss: auxiliary loss that penalizes large logits, reducing training instability.

Mixture of Experts

MoE replaces each dense FFN layer with N expert FFN layers plus a router. Only a subset of experts are activated per token. DeepSeek V3: 671B total parameters, 37B active per token (256 experts, top-8 routing). Auxiliary-loss-free load balancing via bias terms prevents expert collapse.

Positional Encodings

RoPE (Rotary Position Embeddings) encodes position by rotating query/key vectors in 2D subspaces. Relative by construction — attention depends on distance between tokens, not absolute position. Supports extrapolation to longer sequences than seen during training (with techniques like NTK-aware scaling, YaRN).

H1 Optimization Compute +

Optimizers, learning rate schedules, and precision formats that turn gradient descent into efficient learning.

Key Takeaway

Muon (2025) achieves 2x the efficiency of AdamW and is now in PyTorch core. Meanwhile, FP8 training pioneered by DeepSeek V3 delivers 20-50% throughput gains with minimal quality loss.

Think of it like…

A hiker descending a foggy mountain. The optimizer decides step direction and size; the learning rate schedule decides when to take big strides vs. careful steps; precision format determines how detailed their map is.

Optimizers

Optimizer	Memory/Param	Key Advantage
AdamW	12 bytes	Incumbent, well-understood, reliable
Muon	8 bytes	2x efficiency, now in PyTorch core (2025)
SOAP	~12 bytes	>40% fewer iterations to converge
Lion	4 bytes	50% memory savings, sign-based updates

Learning Rate Schedules

Cosine decay: the traditional choice — warm up linearly, then decay following a cosine curve to near-zero. Problem: schedule length must be set before training begins.

WSD (Warmup-Stable-Decay): decouple the schedule from total training steps. Warm up, hold a stable rate for most of training, then decay sharply at the end. Enables checkpoints at any point during the stable phase to be continued or branched.

Mixed Precision

BF16 (bfloat16): the standard for modern training. Same dynamic range as FP32 with half the bits. Supported natively on A100+.

FP8: pioneered at scale by DeepSeek V3. 20-50% throughput gain over BF16 with careful scaling. Uses per-tensor or per-channel quantization with high-precision master weights.

FP4: emerging on NVIDIA Blackwell (B200). Still experimental for training; primarily inference-focused.

Gradient Accumulation

Simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. Llama 3 used an effective batch size of ~16M tokens, achieved through gradient accumulation across thousands of GPUs.

H2 Distributed Training Network +

Splitting model and data across thousands of GPUs — the engineering challenge that defines frontier training.

Key Takeaway

Llama 3 405B trained on 16,384 H100 GPUs using 4D parallelism (TP=8, CP=16, PP=16, DP), achieving 400 TFLOPS per GPU (~40% MFU). Every parallelism strategy trades off communication overhead for memory savings.

Think of it like…

Building a skyscraper with 16,000 workers. Data parallelism gives everyone the same blueprint but different bricks. Tensor parallelism splits each floor across teams. Pipeline parallelism assigns different floors to different crews.

Parallelism Strategies

Data Parallelism & ZeRO +

Replicate model, split data across GPUs

DDP (Distributed Data Parallel): each GPU holds a full model copy, processes different data, synchronizes gradients via all-reduce. Simple but memory-heavy.

ZeRO (Zero Redundancy Optimizer) partitions optimizer states (Stage 1: 4x savings), gradients (Stage 2: 8x), and parameters (Stage 3: linear scaling) across GPUs. Only gathers what’s needed for each operation.

FSDP2 (Fully Sharded Data Parallelism): PyTorch-native ZeRO Stage 3. The default for most training runs that don’t need tensor parallelism.

Tensor Parallelism +

Split weight matrices within a single node

Tensor parallelism splits individual weight matrices across GPUs within the same node (connected by NVLink at 900 GB/s on H100). Typically TP=8 (one per GPU in a node). Each GPU computes a slice of every layer, then exchanges results via all-reduce.

Communication cost is proportional to hidden dimension and number of layers. Only practical within a node — inter-node bandwidth (InfiniBand at 400 Gb/s) is too slow for per-layer synchronization.

Pipeline Parallelism +

Distribute layers across nodes

Pipeline parallelism assigns different transformer layers to different nodes. Micro-batches flow through the pipeline like an assembly line. 1F1B scheduling (one forward, one backward) minimizes the “bubble” where stages idle.

Llama 3 used PP=16, meaning 16 pipeline stages across 16 nodes. Each stage holds ~5 layers of the 126-layer model.

4D Parallelism +

Combining all strategies at frontier scale

Llama 3 405B on 16,384 H100s: TP=8 (within node), CP=16 (context/sequence parallelism for 128K context), PP=16 (pipeline stages), DP (data parallelism across remaining dimension). Total: 8 × 16 × 16 = 2,048 model replicas running in data-parallel.

400 TFLOPS per GPU (~40% MFU). Expert parallelism for MoE adds a 5th dimension in models like DeepSeek V3, where AllToAll routing sends tokens to the correct expert across nodes.

Training Frameworks

Framework	Org	Strength
Megatron-LM	NVIDIA	TP + PP + expert parallelism, highly optimized CUDA kernels
DeepSpeed	Microsoft	ZeRO optimizer, flexible parallelism composition
TorchTitan	Meta	Native PyTorch 4D parallelism, used for Llama 3
JAX/XLA	Google	Functional transforms (pmap/pjit), TPU-native

→ How parallelism works during inference

H3 Monitoring & Recovery Logic +

Frontier training runs fail constantly — Llama 3 had 419 failures in 54 days. Resilience engineering is non-negotiable.

Key Takeaway

At scale, failures are inevitable (~1 every 3 hours for Llama 3). The key differentiator is recovery speed: async checkpointing (ByteCheckpoint: 529x speedup) and auto-repair systems (MegaScale: >90% automated recovery).

Key Metrics

Training loss: the primary signal. Should decrease smoothly; spikes indicate instability. Gradient norms: sudden increases warn of divergence. MFU (Model FLOPS Utilization): what fraction of theoretical GPU compute is actually used. Llama 3 achieved ~40%; >50% is exceptional.

Validation loss: evaluated periodically on held-out data to detect overfitting or data quality issues.

Loss Spike Causes

Residual amplification: deep residual connections can amplify small perturbations. Gradient intensification: sudden gradient magnitude jumps, often triggered by unusual data batches. Attention logit explosion: unbounded dot products in attention — mitigated by QK-norm and logit capping.

Failure Statistics

419

Llama 3 Failures

54d

Training Duration

30%

GPU Hardware Faults

~3h

Mean Time Between

Checkpoint Strategies

Synchronous checkpointing: pause training, write full model state to storage. Simple but adds 12-43% overhead at scale. Async checkpointing (ByteCheckpoint): snapshot to CPU/NVMe while training continues — 529x speedup over synchronous. FSDP2 sharded checkpoints: each GPU writes only its shard, parallel I/O.

Automated Recovery

MegaScale (ByteDance): automated diagnosis and repair for >90% of failures. Detects faulty GPUs, remaps workloads, and resumes from checkpoint without human intervention. At 16K+ GPU scale, manual recovery is impractical.

→ The cost of training failures

I1 Supervised Fine-Tuning Logic +

Train on curated (instruction, response) pairs to establish format, tone, and instruction-following ability.

Key Takeaway

SFT bridges the gap between a base model (which just predicts next tokens) and an assistant (which follows instructions). Quality matters more than quantity — 10K excellent examples can outperform 1M mediocre ones.

Think of it like…

An apprenticeship after general education. The model already “knows” the language; SFT teaches it the manners and format of a helpful assistant.

Process

Train on 10K-100K curated examples of (instruction, ideal response) pairs. The loss function only computes on the response tokens (masking the instruction). Typically 1-3 epochs to avoid overfitting.

Examples come from human annotators, distillation from stronger models, or synthetic generation with human filtering.

Production SFT Modes (Fireworks)

Mode	What You Train	Key Dataset Shape
Supervised text	Single-turn text completions	`text` (and optional `loss_mask`) fields
Supervised chat	Multi-message assistants	`messages[]` with role/content
Supervised vision	Multimodal chat + tools	Chat messages with text/image content arrays

Fireworks recommends starting with LoRA adapters for most enterprise fine-tuning, then escalating to heavier strategies only if quality targets are not met.

Dataset Controls

Loss mask lets you train only on selected spans in supervised text examples. In chat mode, loss role training keeps optimization focused on assistant turns instead of user/tool turns.

For vision and tool-using fine-tunes, Fireworks supports preserving tool-call structure so fine-tuned models remain function-calling compatible in production.

Enterprise deployments increasingly require secure fine-tuning boundaries (customer-controlled buckets and explicit retention controls) before regulated data can enter training pipelines.

Full vs Parameter-Efficient

Full fine-tuning: update all parameters. Best quality but expensive (~$50K per run for a 7B model on H100s).

LoRA/QLoRA: freeze base weights, train small low-rank adapter matrices. LoRA adds <1% parameters, achieves 90-95% of full fine-tuning quality at $300-$3,000 per run. QLoRA further quantizes the base model to 4-bit during training.

Fireworks defaults for LoRA SFT are deliberately conservative (rank 32, alpha 64, context 16K, ~3 epochs), then tuned from there. A major deployment advantage is multi-adapter serving: up to 100 LoRA adapters can share one base model deployment, dramatically improving utilization for long-tail custom models.

I2 Alignment Logic +

Aligning model behavior with human preferences — from RLHF to GRPO, the methods that make models helpful, harmless, and honest.

Key Takeaway

GRPO (DeepSeek R1) eliminated the need for reward models entirely by using verifiable rewards, jumping AIME 2024 scores from 15.6% to 71.0%. The field is rapidly moving away from the complexity of traditional RLHF.

Alignment Methods

RLHF Pipeline +

The original 3-stage approach

RLHF (Reinforcement Learning from Human Feedback) has 3 stages: (1) SFT the base model, (2) train a reward model on human preference comparisons, (3) optimize the policy with PPO against the reward model.

Requires 4 models in memory simultaneously (policy, reference, reward, value). Complex, expensive, and sensitive to hyperparameters. Still used by OpenAI and Anthropic but increasingly supplemented by simpler methods.

DPO +

Direct preference optimization, no reward model

DPO (Direct Preference Optimization) reparameterizes the RLHF objective to directly optimize on preference pairs without training a separate reward model. Only needs 2 models (policy + reference). Simpler pipeline, more stable training.

Limitation: still requires pairwise comparisons (chosen/rejected responses), which are expensive to collect.

DPO Data Contracts (Fireworks) +

Prompt + preferred/non-preferred outputs, with optional tool context

Fireworks DPO expects explicit triplets: prompt, preferred output, and non-preferred output. This makes preference data portable across experiments and easier to audit.

When workflows depend on function calling, datasets can include tool definitions and tool-call traces so preference optimization teaches both quality and tool-use behavior.

GRPO +

Group Relative Policy Optimization

GRPO (DeepSeek R1): no reward model, no value network. Generate multiple responses per prompt, use verifiable rewards (math correctness, code execution, format compliance) to rank them, then optimize the policy on the group-relative ranking.

Result on AIME 2024: 15.6% → 71.0%. Dramatically simpler and more scalable than RLHF for domains with verifiable outcomes.

Constitutional AI & KTO +

AI judges and binary feedback

Constitutional AI (Anthropic): replace human preference annotators with AI judges that evaluate responses against a written constitution. Enables RLAIF (RL from AI Feedback) at much lower cost.

KTO (Kahneman-Tversky Optimization): uses only binary labels (good/bad) instead of pairwise comparisons. Easier to collect feedback — humans just rate individual responses as thumbs up or down.

I3 Reinforcement Fine-Tuning Logic +

Domain-specific RL that combines expert knowledge with model reasoning — teaching models how to think in specialized fields, not just what to say.

Key Takeaway

Reinforcement Fine-Tuning (RFT) bridges the gap between general-purpose alignment and domain expertise. By using verifiable domain-specific rewards, RFT produces specialist models that reason through problems rather than pattern-match from training data. OpenAI, Google, and others now offer RFT as a product for enterprise customers.

Think of it like…

SFT teaches a medical student the textbook answers. RFT gives them a residency — repeated practice with real cases and expert feedback on their reasoning process, not just their final answer.

How It Works

RFT applies reinforcement learning after SFT, using domain-specific reward signals rather than general human preferences. The model generates multiple reasoning chains for each problem, and a reward function (often automated) scores the quality of the reasoning process, not just the final answer.

Key difference from standard RLHF: rewards are verifiable and domain-grounded — math correctness, code execution results, legal citation accuracy, medical diagnosis criteria — rather than subjective human preference.

Operational RFT Workflow (Fireworks)

Fireworks RFT formalizes the loop into four assets: (1) an evaluator model or rubric, (2) a runtime environment for verification, (3) parameter tuning (rollouts/learning rate/steps), and (4) tracing for step-level debugging.

This turns RFT from a one-off experiment into a repeatable production workflow: train → evaluate → tune → retrain, with cost estimators and telemetry at each stage.

RFT Methods

Expert-Guided RFT +

Domain experts design reward criteria

Domain experts provide grading rubrics rather than example answers. For a legal reasoning task, the rubric might score: correct statute identification (+1), proper precedent citation (+1), logical argument structure (+1), correct conclusion (+2). The model learns to maximize the rubric score through repeated practice.

OpenAI’s RFT product uses this approach: customers provide 50-500 expert-graded examples with multi-dimensional scoring criteria. The model trains on thousands of self-generated attempts, evaluated against the rubric.

Verifiable Reward RFT +

Automated verification for math, code, and science

For domains with objectively verifiable answers, rewards can be fully automated: run the generated code, check the math proof, verify the chemical formula. DeepSeek R1’s GRPO is a form of verifiable-reward RFT.

Scales massively because no human annotation is needed. The model can train on millions of self-generated attempts. This is how OpenAI o1/o3 and DeepSeek R1 achieved breakthroughs on math and coding benchmarks.

Outcome vs Process Reward +

Scoring the final answer vs every step

Outcome reward models (ORMs): score only the final answer. Simple but provides sparse signal — the model doesn’t know which reasoning step went wrong.

Process reward models (PRMs): score each intermediate reasoning step. Richer signal, better at teaching correct reasoning chains. OpenAI’s “Let’s Verify Step by Step” (2023) showed PRMs significantly outperform ORMs on math, reducing hallucinated reasoning.

The tradeoff: PRMs require step-level annotations, which are much more expensive to collect than outcome labels.

RFT as a Product +

Enterprise offerings from OpenAI, Google, and others

OpenAI RFT (2024-2025): enterprise product that fine-tunes o1/o3 models with customer-provided grading rubrics. Customers report 10-30% improvement on domain-specific tasks vs. prompting alone. Requires 50-500 graded examples.

Google Gemini RFT: similar offering through Vertex AI, focusing on code generation and structured reasoning tasks.

Fireworks RFT: integrates evaluator models, environment hooks, and secure fine-tuning options (including customer-controlled storage boundaries) for regulated use cases.

The business model: RFT is premium fine-tuning — higher margin than standard SFT because it produces genuinely differentiated models that reason better in the customer’s domain.

RFT vs Other Methods

Method	What It Teaches	Data Needed	Best For
SFT	Format & knowledge	10K-100K examples	Style, format, basic skills
RLHF/DPO	Human preferences	Preference pairs	General helpfulness, safety
RFT	Domain reasoning	50-500 graded + self-play	Specialist reasoning tasks

I4 Evaluation Logic +

Measuring model capabilities across knowledge, code, math, and reasoning — while detecting contamination and gaming.

Key Takeaway

Static benchmarks saturate and leak into training data. The field is shifting to dynamic evaluation: LiveBench refreshes monthly, Chatbot Arena uses live ELO rankings from blind human comparisons.

Benchmark Suites

Knowledge & Reasoning +

MMLU, MMLU-Pro, HellaSwag

MMLU (57 subjects, multiple choice): essentially saturated at >90% for frontier models. MMLU-Pro: 10-option multiple choice with harder questions, 16-33% more difficult. HellaSwag: commonsense reasoning about physical situations.

Code & Math +

HumanEval, SWE-bench, AIME

Code: HumanEval (function completion), LiveCodeBench (monthly-refreshed problems), SWE-bench (real GitHub issues — requires multi-file edits).

Math: GSM8K (grade school, largely saturated), MATH (competition-level), AIME (AMC competition — still challenging for frontier models).

Contamination Detection

Min-K% Prob: measures whether a model assigns suspiciously high probability to benchmark answers, suggesting memorization. CDD (Canonical Data Detection): checks if benchmark examples appear verbatim in training data. Perplexity-based: low perplexity on benchmark prompts suggests data leakage.

Dynamic Evaluation

LiveBench: monthly refresh of questions with verifiable answers. Immune to contamination because questions didn’t exist during training. Chatbot Arena (LMSYS): blind pairwise comparisons by real users, producing ELO rankings. Over 1M votes. The most trusted signal of real-world model quality.

Reasoning-Mode Regression Tests

For reasoning-capable models, evaluate at multiple runtime policies: low/high reasoning effort, budget-capped thinking, and preserved-history on/off. A model that looks strong in one policy can fail latency or quality targets in another.

This is now part of production release criteria: benchmark quality plus policy-specific inference behavior.