Economics of LLM Training

J1 Hardware & Compute Costs Economics +

GPU pricing, cloud rental rates, and the exponential escalation of training costs.

Key Takeaway

Training costs are doubling every ~8 months. GPT-4 cost ~$100M; frontier models by 2027 will exceed $1B. Meanwhile, cloud GPU rental has fallen 60-70%, creating a growing gap between buying and renting.

GPU Pricing

GPU	Price	Notes
H100 SXM	$25-31K	The workhorse of 2024 training
B200	$30-35K	2x H100 performance
GB200 NVL72	~$3M	72-GPU rack system

Cloud Rental Decline

H100 cloud pricing fell from $8/hr (early 2024) to ~$1.50/hr (late 2025) — a 60-70% drop driven by massive supply buildout and competition from CoreWeave, Lambda, and hyperscalers.

Training Cost Benchmarks

Model	Training Cost	GPUs
GPT-4	$78-100M	~25,000 A100s
Llama 3 405B	~$60M	16,384 H100s
DeepSeek V3	$5.5M (GPU only)	2,048 H800s
Gemini Ultra	~$191M	TPU v4 pods

Cost Breakdown (Epoch AI)

Chips: 21-30% · Server/networking: 22-30% · Staff: 29-49% · Energy: 2-6%. Staff costs dominate at smaller scales; hardware dominates at frontier scale.

→ Inference cost stack comparison

J2 Scaling Laws & Efficiency Economics +

The Chinchilla trap, over-training economics, and why smaller models trained longer can be cheaper overall.

Key Takeaway

Chinchilla-optimal training (20 tokens/parameter) minimizes training cost but ignores inference cost. The industry now over-trains smaller models — Llama 3 at 200 tokens/param — to shift cost from inference to training.

Think of it like…

Building a factory vs. running it. Chinchilla minimizes factory construction cost but doesn’t consider that a smaller, better-trained factory may produce cheaper widgets for years to come.

Chinchilla Optimal

Chinchilla scaling laws (2022): for a fixed compute budget, the optimal allocation is 20 tokens per parameter. A 70B model should train on 1.4T tokens. This minimizes training loss per FLOP.

The Chinchilla Trap

Chinchilla-optimal training minimizes training cost but ignores inference cost. A model’s lifetime inference cost can be 10-15x its training cost (GPT-4: ~$100M training, ~$2.3B inference over 18 months). Optimizing for training alone is locally optimal but globally wasteful.

Modern Over-Training

Model	Params	Tokens	Tokens/Param
Chinchilla	70B	1.4T	20x (optimal)
Llama 3 405B	405B	15T	37x
Llama 3 8B	8B	15T	1,875x
Qwen3-0.6B	0.6B	36T	60,000x

The industry shift: spend more on training (one-time) to make every inference call cheaper (ongoing). Over-training smaller models by 10-1,000x Chinchilla-optimal is now standard practice.

J3 Failure & Wasted Compute Economics +

Training failures burn millions in GPU-hours. Checkpointing overhead alone costs 12-43% of training time.

Key Takeaway

OpenAI spent $7B on R&D in one year, with over 70% going to failed experiments. Less than $1B was for the final successful training run. Frontier AI research is dominated by the cost of exploration, not exploitation.

Failure Statistics

419

Llama 3 Failures

$8K

Per 8hr Loss (512 GPUs)

178K

OPT-175B Wasted GPU-hrs

12-43%

Checkpoint Overhead

R&D vs. Final Training

OpenAI 2024 R&D spend: ~$7B total. The vast majority (>70%) went to experimentation — architecture search, hyperparameter tuning, failed training runs, and data mixture experiments. The final production training run for a model like GPT-4 costs <$1B of that total.

This mirrors pharmaceutical R&D: the drug that ships costs a fraction of the total research budget that produced hundreds of failed candidates.

→ Monitoring & recovery systems

K1 Build vs Fine-Tune vs API Economics +

Foundation training costs $50M+; LoRA fine-tuning costs $300. The decision framework for when each makes sense.

Key Takeaway

For 99% of organizations, fine-tuning or API access is the right choice. Foundation model training is only viable for the largest labs with differentiated data or architecture advantages.

Cost Comparison

Approach	Cost Range	When It Makes Sense
Foundation training	$50M-$500M+	Differentiated architecture, massive data moat
Full fine-tuning	~$50K/run (7B)	Need deep domain specialization
LoRA/QLoRA	$300-$3,000/run	90-95% of fine-tuning quality at 1% cost
API inference	Pay per token	No training cost, fastest time to value

Decision Framework

Build from scratch if: you have unique data at scale, need architecture control, and can sustain $100M+/year in compute spend (OpenAI, Anthropic, Google, Meta).

Fine-tune if: you need domain-specific behavior, proprietary knowledge injection, or format/style control that prompting can’t achieve.

Use APIs if: your use case can be solved with prompting + RAG, you need rapid iteration, or you can’t justify the fixed cost of training infrastructure.

Managed Fine-Tuning Economics (Fireworks)

Modern managed stacks now cover the full post-training ladder: supervised text/chat/vision fine-tuning, DPO, and RFT. This collapses integration overhead compared with stitching multiple vendors.

Fireworks-specific levers that change unit economics: one-click LoRA deployment, support for up to 100 LoRA adapters per base model on a single deployment, and secure enterprise modes for customer-controlled data boundaries.

Operationally, synthetic data generation plus built-in evaluation loops reduce labeling and experimentation costs, shifting budgets from infrastructure plumbing to model quality iteration.

K2 Training Providers Economics +

Foundation model labs, training-as-a-service companies, and the cloud vs. on-prem cost differential.

Key Takeaway

Llama 3 would have cost $483M on AWS vs. ~$60M on Meta’s own cluster. The 8x cost advantage of on-prem at scale explains why every major lab is building its own data centers.

Foundation Model Labs

Lab	Compute Spend	CapEx (2025)
OpenAI	~$7B/yr R&D	Partnership with Microsoft
Anthropic	$33.7B raised	AWS + GCP partnerships
Google DeepMind	Internal TPU	$85B total AI CapEx
Meta AI	Own clusters	$68B total AI CapEx

Training-as-a-Service

Together AI ($3.3B valuation): managed training and inference platform. Anyscale ($1B+): Ray-based distributed training. Modal ($1.1B): serverless GPU compute with training focus. These companies provide the infrastructure layer for organizations that want to train without building their own clusters.

K3 Cloud vs On-Premise Economics +

Cloud offers flexibility; on-prem offers 3-8x cost savings for sustained workloads. The breakeven is at ~60-70% utilization.

Key Takeaway

Cloud = buying a call option on compute (pay premium for flexibility). On-prem = owning the asset (3-8x cheaper if utilization stays above 60-70%). The right answer depends on demand predictability and time horizon.

Cloud Advantages

No upfront CapEx. Access to latest hardware without procurement delays. Burst capacity for experiments. Scale down when not training. Option value: switch hardware as next-gen GPUs ship every 12-18 months.

On-Prem Advantages

3-8x cheaper for sustained workloads at high utilization. Full control over hardware configuration, networking topology, and security. No egress charges for massive data movement. Predictable cost structure for financial planning.

Hidden On-Prem Costs

Data center space and construction. Power infrastructure and cooling systems. Network engineering staff. 5-6 month GPU procurement lead times. Hardware maintenance and spare inventory. Opportunity cost of capital locked in depreciating assets.

→ Data center economics for inference

L1 GPU Financing Economics +

CoreWeave raised $7.6B in GPU-backed debt. GPUs are the new collateral class — but depreciation risk is severe.

Key Takeaway

GPU-backed financing enabled the AI infrastructure buildout, but the underlying collateral is depreciating rapidly: H100 rentals fell 60-70%, and hardware faces 30-40% year-one depreciation. Lenders are taking concentrated technology risk.

Key Deals

CoreWeave: $7.6B in GPU-backed debt, $14.6B in equipment assets. IPO’d at $40, reached $183. The poster child for GPU-as-collateral financing.

Lambda: $1.5B sale-leaseback — sell GPUs to a lessor, lease them back. Frees capital while retaining operational control.

Creative structures: 5-year GPU leases, synthetic GPU derivatives, sale-leasebacks backed by contracted cloud revenue.

Depreciation Risk

H100 cloud rental fell 60-70% in 18 months. Hardware faces 30-40% year-one depreciation. Each new generation (B200, GB200) makes the previous one less economically viable. Lenders underwriting 5-year GPU loans face significant residual value risk.

→ Equity vs debt financing for inference

L2 Foundation Model Funding Economics +

Top 10 AI mega-rounds in 2025 totaled $84B. Big Tech AI CapEx: $405B in 2025, projected $3T through 2029.

Key Takeaway

AI funding has reached unprecedented scale: Anthropic raised $30B at $380B valuation, OpenAI planned $40B+ rounds, and Big Tech collectively committed $405B in 2025 AI CapEx. The GPU-rich vs GPU-poor dynamic is the defining market structure.

Recent Mega-Rounds

Company	Amount	Valuation
OpenAI	$40B+	~$300B
Anthropic	$30B	$380B (Feb 2026)
xAI	$6B	$50B
CoreWeave	$7.6B debt + IPO	$35B+ at peak

Big Tech AI Infrastructure

2025 CapEx commitments: Google $85B, Meta $68B, Microsoft $80B, Amazon $100B+, Apple $500M (data centers). Morgan Stanley projects $3T cumulative AI infrastructure spending through 2029.

OpenAI’s $1T infrastructure plan through 2035 includes custom data centers, chip development (with Samsung/TSMC), and a global network of training clusters.

L3 Training vs Inference Spend Economics +

The industry is shifting from training-dominated to inference-dominated spend — from 2:1 in 2023 to 1:2 by 2026.

Key Takeaway

GPT-4’s lifetime inference cost (~$2.3B) is 15x its training cost (~$100M). As models are deployed at scale, inference becomes the dominant cost driver. The AI inference market is projected to grow from $106B (2025) to $255B by 2030.

Spend Ratio Shift

Year	Training	Inference
2023	~67%	~33%
2025	~50%	~50%
2026 (proj.)	~33%	~67%

Reasoning Models Shift More Cost to Inference

Reasoning-capable deployment modes (interleaved thinking, preserved reasoning context, tool-augmented loops) increase runtime decode work per request. Even when training cost is unchanged, inference spend grows faster because each query executes more steps.

Cost controls increasingly move to runtime policy: reasoning effort levels, thinking-token budgets, and history retention policies. In practice, product pricing and scheduler policy now matter as much as model architecture for total lifecycle margin.

OpenAI 2024 Breakdown

$3B

Training Compute

$1.8B

Inference Compute

$2B

R&D (Failed Experiments)

Industry Spend vs Revenue

Total AI industry spend in 2025: ~$527B. Total AI revenue: ~$51B. That’s a 10:1 spend-to-revenue ratio, the largest infrastructure-to-revenue gap since the early days of cloud computing. The bet: inference revenue at scale will eventually justify the capital deployed.

From GPU to Balance Sheet