From raw data to aligned model — every stage of building a large language model.
DCLM showed keeping only the top 10% of data by quality beats training on everything. Data curation is the single highest-leverage investment in model quality.
A chef sourcing ingredients — the finest technique can’t save a dish made from spoiled food. Training data is the raw ingredient of intelligence.
Modern training corpora start from web crawls. CommonCrawl provides petabytes of raw HTML. Curated derivatives include FineWeb (15T tokens), DCLM (240T raw → 3.8T curated tokens), and RedPajama. Additional sources: books, scientific papers (~5%), code from GitHub/StackOverflow (~15%), and Wikipedia.
Heuristic filters: remove documents by length (too short = low content), perplexity (too high = garbled text), PII patterns, and URL blocklists. Classifier-based filtering: train a fastText model on high-quality vs. low-quality examples, then score and threshold every document. DCLM used a 1.4B-parameter classifier trained on OpenHermes 2.5 to keep only top-quality text.
Duplicate content wastes compute and can cause memorization. MinHash LSH (Locality-Sensitive Hashing) efficiently finds near-duplicate documents by approximating Jaccard similarity. SimHash provides a faster alternative for exact-duplicate detection. Exact substring dedup (suffix arrays) catches boilerplate repeated across many pages.
The final training mixture balances domains: typically ~70% web text, 15% code, 5% scientific, 5% books, 5% other (conversation, math, encyclopedic). Optimal mixing ratios are determined empirically by training small proxy models on different mixes and measuring downstream task performance.
Vocabulary size is a fundamental trade-off: larger vocabularies compress text better (fewer tokens per sentence) but increase embedding table size and can fragment rare words.
Building a dictionary before learning a language. The tokenizer decides which “words” the model will think in — too few and it stutters, too many and it wastes memory on rare terms.
Byte-Pair Encoding starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. After thousands of merges, common words become single tokens while rare words decompose into subword pieces. GPT-4 uses tiktoken (100K vocabulary); Llama 3 uses SentencePiece (128K vocabulary, up from 32K in Llama 2).
The Unigram approach works in reverse: start with a very large vocabulary, then iteratively prune tokens that contribute least to the training corpus likelihood. SentencePiece supports both BPE and Unigram. Unigram tends to produce more linguistically meaningful subwords.
| Model | Vocab Size | Method |
|---|---|---|
| GPT-4 | 100,256 | BPE (tiktoken) |
| Llama 3 | 128,256 | BPE (SentencePiece) |
| Llama 2 | 32,000 | BPE (SentencePiece) |
| Claude 3 | ~100,000 | BPE |
| Gemini | 256,000 | SentencePiece |
Modern LLM architecture has converged on a small set of proven components. Innovation now happens at the edges: MoE for efficiency, MLA for KV cache compression, and deeper attention alternatives.
Multi-Head Attention (MHA): each head has its own Q, K, V projections. Full expressivity but large KV cache (proportional to number of heads).
Multi-Query Attention (MQA): all heads share a single K and V. Dramatically reduces KV cache but can hurt quality.
Grouped-Query Attention (GQA): compromise — heads are grouped, each group shares K/V. Llama 3 uses 8 KV heads for 64 query heads (8:1 ratio). Near-MHA quality with near-MQA efficiency.
MLA (DeepSeek V2/V3) compresses keys and values into a low-rank latent space before caching. Instead of storing full K/V tensors, it stores a compressed representation and reconstructs K/V on the fly during attention.
Result: 28x reduction in KV cache size compared to standard MHA, with minimal quality loss. This enables much larger batch sizes during inference.
SwiGLU replaces the standard 2-matrix FFN (up-project → ReLU → down-project) with a 3-matrix gated design: two parallel up-projections, one gated by SiLU activation, then multiplied element-wise before the down-projection. Intermediate dimension is typically (8/3) × d_model to keep parameter count comparable.
QK-Norm: normalize query and key vectors before the dot product to prevent attention logit explosion at scale. Logit capping: clip pre-softmax logits to a maximum value (e.g., 30.0 in Gemma 2). Z-loss: auxiliary loss that penalizes large logits, reducing training instability.
MoE replaces each dense FFN layer with N expert FFN layers plus a router. Only a subset of experts are activated per token. DeepSeek V3: 671B total parameters, 37B active per token (256 experts, top-8 routing). Auxiliary-loss-free load balancing via bias terms prevents expert collapse.
RoPE (Rotary Position Embeddings) encodes position by rotating query/key vectors in 2D subspaces. Relative by construction — attention depends on distance between tokens, not absolute position. Supports extrapolation to longer sequences than seen during training (with techniques like NTK-aware scaling, YaRN).
Muon (2025) achieves 2x the efficiency of AdamW and is now in PyTorch core. Meanwhile, FP8 training pioneered by DeepSeek V3 delivers 20-50% throughput gains with minimal quality loss.
A hiker descending a foggy mountain. The optimizer decides step direction and size; the learning rate schedule decides when to take big strides vs. careful steps; precision format determines how detailed their map is.
| Optimizer | Memory/Param | Key Advantage |
|---|---|---|
| AdamW | 12 bytes | Incumbent, well-understood, reliable |
| Muon | 8 bytes | 2x efficiency, now in PyTorch core (2025) |
| SOAP | ~12 bytes | >40% fewer iterations to converge |
| Lion | 4 bytes | 50% memory savings, sign-based updates |
Cosine decay: the traditional choice — warm up linearly, then decay following a cosine curve to near-zero. Problem: schedule length must be set before training begins.
WSD (Warmup-Stable-Decay): decouple the schedule from total training steps. Warm up, hold a stable rate for most of training, then decay sharply at the end. Enables checkpoints at any point during the stable phase to be continued or branched.
BF16 (bfloat16): the standard for modern training. Same dynamic range as FP32 with half the bits. Supported natively on A100+.
FP8: pioneered at scale by DeepSeek V3. 20-50% throughput gain over BF16 with careful scaling. Uses per-tensor or per-channel quantization with high-precision master weights.
FP4: emerging on NVIDIA Blackwell (B200). Still experimental for training; primarily inference-focused.
Simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. Llama 3 used an effective batch size of ~16M tokens, achieved through gradient accumulation across thousands of GPUs.
Llama 3 405B trained on 16,384 H100 GPUs using 4D parallelism (TP=8, CP=16, PP=16, DP), achieving 400 TFLOPS per GPU (~40% MFU). Every parallelism strategy trades off communication overhead for memory savings.
Building a skyscraper with 16,000 workers. Data parallelism gives everyone the same blueprint but different bricks. Tensor parallelism splits each floor across teams. Pipeline parallelism assigns different floors to different crews.
DDP (Distributed Data Parallel): each GPU holds a full model copy, processes different data, synchronizes gradients via all-reduce. Simple but memory-heavy.
ZeRO (Zero Redundancy Optimizer) partitions optimizer states (Stage 1: 4x savings), gradients (Stage 2: 8x), and parameters (Stage 3: linear scaling) across GPUs. Only gathers what’s needed for each operation.
FSDP2 (Fully Sharded Data Parallelism): PyTorch-native ZeRO Stage 3. The default for most training runs that don’t need tensor parallelism.
Tensor parallelism splits individual weight matrices across GPUs within the same node (connected by NVLink at 900 GB/s on H100). Typically TP=8 (one per GPU in a node). Each GPU computes a slice of every layer, then exchanges results via all-reduce.
Communication cost is proportional to hidden dimension and number of layers. Only practical within a node — inter-node bandwidth (InfiniBand at 400 Gb/s) is too slow for per-layer synchronization.
Pipeline parallelism assigns different transformer layers to different nodes. Micro-batches flow through the pipeline like an assembly line. 1F1B scheduling (one forward, one backward) minimizes the “bubble” where stages idle.
Llama 3 used PP=16, meaning 16 pipeline stages across 16 nodes. Each stage holds ~5 layers of the 126-layer model.
Llama 3 405B on 16,384 H100s: TP=8 (within node), CP=16 (context/sequence parallelism for 128K context), PP=16 (pipeline stages), DP (data parallelism across remaining dimension). Total: 8 × 16 × 16 = 2,048 model replicas running in data-parallel.
400 TFLOPS per GPU (~40% MFU). Expert parallelism for MoE adds a 5th dimension in models like DeepSeek V3, where AllToAll routing sends tokens to the correct expert across nodes.
| Framework | Org | Strength |
|---|---|---|
| Megatron-LM | NVIDIA | TP + PP + expert parallelism, highly optimized CUDA kernels |
| DeepSpeed | Microsoft | ZeRO optimizer, flexible parallelism composition |
| TorchTitan | Meta | Native PyTorch 4D parallelism, used for Llama 3 |
| JAX/XLA | Functional transforms (pmap/pjit), TPU-native |
At scale, failures are inevitable (~1 every 3 hours for Llama 3). The key differentiator is recovery speed: async checkpointing (ByteCheckpoint: 529x speedup) and auto-repair systems (MegaScale: >90% automated recovery).
Training loss: the primary signal. Should decrease smoothly; spikes indicate instability. Gradient norms: sudden increases warn of divergence. MFU (Model FLOPS Utilization): what fraction of theoretical GPU compute is actually used. Llama 3 achieved ~40%; >50% is exceptional.
Validation loss: evaluated periodically on held-out data to detect overfitting or data quality issues.
Residual amplification: deep residual connections can amplify small perturbations. Gradient intensification: sudden gradient magnitude jumps, often triggered by unusual data batches. Attention logit explosion: unbounded dot products in attention — mitigated by QK-norm and logit capping.
Synchronous checkpointing: pause training, write full model state to storage. Simple but adds 12-43% overhead at scale. Async checkpointing (ByteCheckpoint): snapshot to CPU/NVMe while training continues — 529x speedup over synchronous. FSDP2 sharded checkpoints: each GPU writes only its shard, parallel I/O.
MegaScale (ByteDance): automated diagnosis and repair for >90% of failures. Detects faulty GPUs, remaps workloads, and resumes from checkpoint without human intervention. At 16K+ GPU scale, manual recovery is impractical.
SFT bridges the gap between a base model (which just predicts next tokens) and an assistant (which follows instructions). Quality matters more than quantity — 10K excellent examples can outperform 1M mediocre ones.
An apprenticeship after general education. The model already “knows” the language; SFT teaches it the manners and format of a helpful assistant.
Train on 10K-100K curated examples of (instruction, ideal response) pairs. The loss function only computes on the response tokens (masking the instruction). Typically 1-3 epochs to avoid overfitting.
Examples come from human annotators, distillation from stronger models, or synthetic generation with human filtering.
| Mode | What You Train | Key Dataset Shape |
|---|---|---|
| Supervised text | Single-turn text completions | text (and optional loss_mask) fields |
| Supervised chat | Multi-message assistants | messages[] with role/content |
| Supervised vision | Multimodal chat + tools | Chat messages with text/image content arrays |
Fireworks recommends starting with LoRA adapters for most enterprise fine-tuning, then escalating to heavier strategies only if quality targets are not met.
Loss mask lets you train only on selected spans in supervised text examples. In chat mode, loss role training keeps optimization focused on assistant turns instead of user/tool turns.
For vision and tool-using fine-tunes, Fireworks supports preserving tool-call structure so fine-tuned models remain function-calling compatible in production.
Enterprise deployments increasingly require secure fine-tuning boundaries (customer-controlled buckets and explicit retention controls) before regulated data can enter training pipelines.
Full fine-tuning: update all parameters. Best quality but expensive (~$50K per run for a 7B model on H100s).
LoRA/QLoRA: freeze base weights, train small low-rank adapter matrices. LoRA adds <1% parameters, achieves 90-95% of full fine-tuning quality at $300-$3,000 per run. QLoRA further quantizes the base model to 4-bit during training.
Fireworks defaults for LoRA SFT are deliberately conservative (rank 32, alpha 64, context 16K, ~3 epochs), then tuned from there. A major deployment advantage is multi-adapter serving: up to 100 LoRA adapters can share one base model deployment, dramatically improving utilization for long-tail custom models.
GRPO (DeepSeek R1) eliminated the need for reward models entirely by using verifiable rewards, jumping AIME 2024 scores from 15.6% to 71.0%. The field is rapidly moving away from the complexity of traditional RLHF.
RLHF (Reinforcement Learning from Human Feedback) has 3 stages: (1) SFT the base model, (2) train a reward model on human preference comparisons, (3) optimize the policy with PPO against the reward model.
Requires 4 models in memory simultaneously (policy, reference, reward, value). Complex, expensive, and sensitive to hyperparameters. Still used by OpenAI and Anthropic but increasingly supplemented by simpler methods.
DPO (Direct Preference Optimization) reparameterizes the RLHF objective to directly optimize on preference pairs without training a separate reward model. Only needs 2 models (policy + reference). Simpler pipeline, more stable training.
Limitation: still requires pairwise comparisons (chosen/rejected responses), which are expensive to collect.
Fireworks DPO expects explicit triplets: prompt, preferred output, and non-preferred output. This makes preference data portable across experiments and easier to audit.
When workflows depend on function calling, datasets can include tool definitions and tool-call traces so preference optimization teaches both quality and tool-use behavior.
GRPO (DeepSeek R1): no reward model, no value network. Generate multiple responses per prompt, use verifiable rewards (math correctness, code execution, format compliance) to rank them, then optimize the policy on the group-relative ranking.
Result on AIME 2024: 15.6% → 71.0%. Dramatically simpler and more scalable than RLHF for domains with verifiable outcomes.
Constitutional AI (Anthropic): replace human preference annotators with AI judges that evaluate responses against a written constitution. Enables RLAIF (RL from AI Feedback) at much lower cost.
KTO (Kahneman-Tversky Optimization): uses only binary labels (good/bad) instead of pairwise comparisons. Easier to collect feedback — humans just rate individual responses as thumbs up or down.
Reinforcement Fine-Tuning (RFT) bridges the gap between general-purpose alignment and domain expertise. By using verifiable domain-specific rewards, RFT produces specialist models that reason through problems rather than pattern-match from training data. OpenAI, Google, and others now offer RFT as a product for enterprise customers.
SFT teaches a medical student the textbook answers. RFT gives them a residency — repeated practice with real cases and expert feedback on their reasoning process, not just their final answer.
RFT applies reinforcement learning after SFT, using domain-specific reward signals rather than general human preferences. The model generates multiple reasoning chains for each problem, and a reward function (often automated) scores the quality of the reasoning process, not just the final answer.
Key difference from standard RLHF: rewards are verifiable and domain-grounded — math correctness, code execution results, legal citation accuracy, medical diagnosis criteria — rather than subjective human preference.
Fireworks RFT formalizes the loop into four assets: (1) an evaluator model or rubric, (2) a runtime environment for verification, (3) parameter tuning (rollouts/learning rate/steps), and (4) tracing for step-level debugging.
This turns RFT from a one-off experiment into a repeatable production workflow: train → evaluate → tune → retrain, with cost estimators and telemetry at each stage.
Domain experts provide grading rubrics rather than example answers. For a legal reasoning task, the rubric might score: correct statute identification (+1), proper precedent citation (+1), logical argument structure (+1), correct conclusion (+2). The model learns to maximize the rubric score through repeated practice.
OpenAI’s RFT product uses this approach: customers provide 50-500 expert-graded examples with multi-dimensional scoring criteria. The model trains on thousands of self-generated attempts, evaluated against the rubric.
For domains with objectively verifiable answers, rewards can be fully automated: run the generated code, check the math proof, verify the chemical formula. DeepSeek R1’s GRPO is a form of verifiable-reward RFT.
Scales massively because no human annotation is needed. The model can train on millions of self-generated attempts. This is how OpenAI o1/o3 and DeepSeek R1 achieved breakthroughs on math and coding benchmarks.
Outcome reward models (ORMs): score only the final answer. Simple but provides sparse signal — the model doesn’t know which reasoning step went wrong.
Process reward models (PRMs): score each intermediate reasoning step. Richer signal, better at teaching correct reasoning chains. OpenAI’s “Let’s Verify Step by Step” (2023) showed PRMs significantly outperform ORMs on math, reducing hallucinated reasoning.
The tradeoff: PRMs require step-level annotations, which are much more expensive to collect than outcome labels.
OpenAI RFT (2024-2025): enterprise product that fine-tunes o1/o3 models with customer-provided grading rubrics. Customers report 10-30% improvement on domain-specific tasks vs. prompting alone. Requires 50-500 graded examples.
Google Gemini RFT: similar offering through Vertex AI, focusing on code generation and structured reasoning tasks.
Fireworks RFT: integrates evaluator models, environment hooks, and secure fine-tuning options (including customer-controlled storage boundaries) for regulated use cases.
The business model: RFT is premium fine-tuning — higher margin than standard SFT because it produces genuinely differentiated models that reason better in the customer’s domain.
| Method | What It Teaches | Data Needed | Best For |
|---|---|---|---|
| SFT | Format & knowledge | 10K-100K examples | Style, format, basic skills |
| RLHF/DPO | Human preferences | Preference pairs | General helpfulness, safety |
| RFT | Domain reasoning | 50-500 graded + self-play | Specialist reasoning tasks |
Static benchmarks saturate and leak into training data. The field is shifting to dynamic evaluation: LiveBench refreshes monthly, Chatbot Arena uses live ELO rankings from blind human comparisons.
MMLU (57 subjects, multiple choice): essentially saturated at >90% for frontier models. MMLU-Pro: 10-option multiple choice with harder questions, 16-33% more difficult. HellaSwag: commonsense reasoning about physical situations.
Code: HumanEval (function completion), LiveCodeBench (monthly-refreshed problems), SWE-bench (real GitHub issues — requires multi-file edits).
Math: GSM8K (grade school, largely saturated), MATH (competition-level), AIME (AMC competition — still challenging for frontier models).
Min-K% Prob: measures whether a model assigns suspiciously high probability to benchmark answers, suggesting memorization. CDD (Canonical Data Detection): checks if benchmark examples appear verbatim in training data. Perplexity-based: low perplexity on benchmark prompts suggests data leakage.
LiveBench: monthly refresh of questions with verifiable answers. Immune to contamination because questions didn’t exist during training. Chatbot Arena (LMSYS): blind pairwise comparisons by real users, producing ELO rankings. Over 1M votes. The most trusted signal of real-world model quality.
For reasoning-capable models, evaluate at multiple runtime policies: low/high reasoning effort, budget-capped thinking, and preserved-history on/off. A model that looks strong in one policy can fail latency or quality targets in another.
This is now part of production release criteria: benchmark quality plus policy-specific inference behavior.