How AI inference is priced, built, financed, and scaled — the economics behind every API call.
The technical pipeline follows the token from request to response. This page follows the dollar — from the cost of a single GPU-hour through pricing, business models, and capital markets.
Cost per million tokens = Total hourly GPU cost ÷ tokens served per hour. Throughput optimization is literally margin expansion.
A restaurant's cost per meal — ingredients, rent, labor, utilities. Two restaurants with identical kitchens can have wildly different costs per plate based on how many covers they turn.
| Component | Crusoe (owns infra) | CoreWeave (leases some) |
|---|---|---|
| GPU CapEx amortized (3yr) | ~$0.95/hr | ~$1.10/hr |
| Power | ~$0.03/hr | ~$0.07/hr |
| Data center (amortized) | ~$0.15/hr | ~$0.25/hr |
| Networking | ~$0.10/hr | ~$0.10/hr |
| Operations/platform | ~$0.10/hr | ~$0.12/hr |
| Total cost per H100-hr | ~$1.33/hr | ~$1.64/hr |
| Selling price | $2.20/hr | $2.20/hr |
| Gross Margin | ~40% | ~34% |
GPU purchase: H100 SXM ~$25-30K, B200 ~$30-40K. Purchased GPUs amortize over 3-5 years — cheaper per hour if utilization is high, but depreciation risk from technology obsolescence. Rented GPUs are OpEx at ~$2.00-3.00/hr mid-market (down from $7-8/hr peak).
GPU failure rate: ~2-5% annually, requiring spare buffer inventory. At 10,000 GPUs, expect 200-500 failures per year.
Power cost varies wildly: $0.03-0.05/kWh (Crusoe’s stranded energy) vs $0.08-0.12/kWh (grid in Northern Virginia). A single H100 draws ~700W under load. At $0.10/kWh = $0.07/hr per GPU. At 10,000 GPUs = $6.1M/year in electricity.
PUE (Power Usage Effectiveness): 1.1 means 10% cooling overhead; 1.4 means 40%. Liquid cooling pushes closer to 1.1, air cooling sits at 1.3-1.5. Every 0.1 improvement in PUE across a 100MW facility saves ~$600K/year.
InfiniBand for multi-GPU inference adds 15-25% to total cluster cost. For a 1,000 H100 cluster: $5-15M in NICs, switches, and cabling. NVIDIA/Mellanox near-monopoly limits price negotiation.
Storage for model weights, KV cache spill, logging: ~5-10% of compute cost.
→ How InfiniBand works in the pipeline → Training hardware costs comparisonOperational costs include SRE/DevOps staff, monitoring infrastructure, on-call rotations, and platform orchestration (Kubernetes, Slurm, auto-node-replacement).
GPU failures at 2-5% annually mean a 10,000-GPU fleet needs constant triage — detecting degraded GPUs, migrating workloads, RMA processing. This is where OpEx quietly compounds.
Throughput optimization IS margin expansion. If custom CUDA kernels serve 2x the tokens/sec on the same GPU, cost per token halves.
Two factories with identical machines. One runs at 30% capacity with long changeover times; the other runs at 85% with quick changeovers. Same capital cost, dramatically different unit economics.
Serving an 8B model is ~10x cheaper per token than 405B. The 405B model needs 8-16 GPUs (vs 1), performs 50x more matrix operations per token, and the larger KV cache per token reduces batch slots from 64+ to 8-16.
| Factor | 8B | 405B | Penalty |
|---|---|---|---|
| Weight data per token | ~16 GB | ~810 GB | ~50x |
| GPUs required | 1 | 8-16 | 8-16x cost |
| Communication overhead | Zero | 126 all-reduce ops | Pure penalty |
| Batch size | 64+ | 8-16 | 4-8x less throughput |
| Cost / M tokens | $0.03-0.05 | $0.50-1.00 | 10-20x |
The same model on the same GPU can have 3-5x throughput variance between naive and fully optimized serving. This includes custom attention kernels (FlashAttention, PagedAttention), quantization (FP16 → INT4), continuous batching, speculative decoding, and prefix caching.
This is why inference platform companies like Fireworks exist — their serving infrastructure (FireAttention, MemoryAlloy) extracts dramatically more tokens/sec from the same hardware.
→ Performance Optimization Stack detailsA GPU serving one request at a time wastes most of its compute capacity. Continuous batching enables 80-90% utilization by adding new requests to an in-flight batch as slots free up.
Request characteristics matter: long context uses more KV cache memory, reducing batch slots. Output tokens cost more than input tokens because decode is sequential while prefill is parallel.
→ How continuous batching worksReasoning models add internal reasoning streams on top of visible output. Effective capacity planning must split visible tokens from reasoning tokens, then cap runtime via reasoning effort or thinking budget controls.
Interleaved thinking increases scheduler contention because reasoning and tool calls share the same decode loop. Preserved thinking can reduce repeated work across turns, but raises carried-context cost if history policies are too permissive.
→ Decode behavior for reasoning modelsHardware generation creates step-function improvements. B200 with FP4 serves the same model at dramatically higher throughput than H100 at FP16. Blackwell delivers 3-5x performance per dollar vs Hopper.
This drives the technology obsolescence risk in GPU ownership — an H100 purchased today may be economically obsolete before fully depreciated.
Frontier inference cost is declining ~10x annually. Reserved contracts are economically similar to fixed-rate swaps on GPU compute prices.
Airline pricing — first class, economy, standby, and corporate contracts all sell the same seat-mile at wildly different prices based on flexibility, commitment, and timing.
| Segment | Volume | Pricing Model | Margin |
|---|---|---|---|
| Hobbyist / prototyping | <1M tok/day | Serverless per-token, free tier | Low/negative |
| Growth startup | 1-100M tok/day | Serverless → on-demand | Medium, expanding |
| Enterprise | 100M+ tok/day | Reserved capacity, custom | High |
| Batch / offline | Large, flexible | Batch pricing (50% off) | Medium (fills idle) |
GPT-3.5 equivalent pricing: $20/M tokens (late 2022) → $0.40/M tokens (2025). A 50x decline in ~2.5 years. Reserved contracts lock in today’s price — if costs keep falling, the provider profits on the spread while customers overpay for certainty.
Standard for API providers. Input/output split is critical: output tokens cost 2-4x more to serve because decode is sequential. Without the split, adverse selection occurs — output-heavy workloads subsidized by input-heavy ones.
Cached input pricing (~50% discount) incentivizes consistent prefixes, improving server-side prefix cache hit rates. This is pricing that shapes behavior to reduce costs.
→ Rate limits as pricing leversFor reasoning models, many teams now separate visible output from reasoning streams (for example reasoning_content) and apply policy caps with effort, budget, and history controls.
Fine-tuned offerings add another pricing layer: base model usage + adapter premium. Fireworks-style LoRA deployment economics (including multi-adapter serving and one-click adapter deployment) allow providers to segment by workflow quality, not just raw token volume.
→ Fine-tuning methods and deployment implicationsRevenue predictability for both sides. Reserved contracts are economically similar to fixed-rate swaps: provider receives fixed payments, effectively pays floating (market cost of delivering compute). In a deflationary environment, existing contracts become more valuable.
The real risk isn’t the contract price — it’s what the contract signals about capacity planning. Contracted capacity is hedged; uncontracted capacity is an outright bet on future pricing.
Like standby airline tickets — fills idle capacity. Customer gives up latency guarantees in exchange for significant discount. Provider gains utilization in off-peak periods.
Batch workloads smooth demand curves: if real-time peaks at 2pm, batch jobs absorb 2am-6am idle capacity. This improves overall fleet utilization from 40-50% to 80-90%.
Land and expand: Price serverless tier aggressively (even below cost) to acquire developers, monetize at scale when usage grows.
Value-based: Voice agent inference priced per-minute, not per-token. Aligns with customer’s value perception — they think in “minutes of agent time,” not tokens.
Competitive moat: If you have 2x throughput advantage (FireAttention), price 30% below competitors while maintaining better margins. Your optimization is their impossibility.
Managed inference sells outcomes at 55-70% gross margin via statistical multiplexing. GPU rental sells infrastructure at 34-40% margin with simpler operations but massive CapEx.
Uber vs Hertz. Uber sells rides (outcomes) and pools cars across thousands of passengers. Hertz rents you the car — you worry about driving. Both make money on vehicles, but the economics are completely different.
| Dimension | Managed Inference | GPU Rental |
|---|---|---|
| What you sell | Tokens, completions, minutes | Raw GPU-hours |
| Customer thinks about | Outcomes | Infrastructure |
| Gross margin | 55-70% | 34-40% |
| Key advantage | Statistical multiplexing | Simpler operations |
| CapEx intensity | Lower (can rent GPUs) | Very high |
| R&D cost | High (serving stack) | Moderate (platform) |
| Risk | Correlated demand spikes | Utilization & pricing deflation |
Statistical multiplexing — Customer A peaks at 2pm, Customer B at 6pm, Customer C runs batch overnight. Pooling across thousands of customers enables 80-90% GPU utilization, far higher than any single customer achieves.
Like insurance pooling — the law of large numbers applies to token traffic. The risk: correlated demand spikes. When a new model drops or a viral AI demo happens, everyone hits the API simultaneously. This is the “catastrophic event” requiring burst capacity or graceful degradation.
→ How continuous batching enables thisInput vs output split: Output tokens cost 2-4x more to serve. Without the split, adverse selection occurs — output-heavy workloads are subsidized.
Cached input pricing: ~50% discount incentivizes consistent prefixes, improving prefix cache hit rates. Pricing that shapes behavior.
Model-specific pricing: Based on cost to serve, competitive pricing, demand elasticity, and strategic value. Popular models subsidize long-tail catalog with sparse traffic.
1. Model catalog: Which models to add and when — each has hosting cost whether used or not. Warm GPUs for a model nobody calls is pure waste.
2. Rate limits & SLA tiers: Maps to burst capacity reserved per customer. Higher tier = more dedicated headroom = higher price.
3. Model deprecation: Old models eat GPU memory, but enterprise customers depend on them. Migration timelines are diplomatic minefields.
4. Optimization pass-through: When your team ships a 2x throughput improvement, do you pass savings to customers (growth) or keep as margin (profitability)?
When Crusoe launches Managed Inference alongside GPU rental, they create internal tension. If MemoryAlloy delivers 9.9x better TTFT and 81% cost reduction, why would any inference customer rent raw GPUs?
The resolution: GPU rental increasingly serves training customers and custom workloads. Managed inference captures inference demand at higher margin. Total revenue per GPU potentially goes up — but the PM must model the revenue migration carefully.
Buying a GPU = being long a depreciating asset with uncertain future value. Renting = buying a monthly call option on compute capacity. The optimal strategy is a barbell: own base-load, rent flexibility.
Electric utilities: own baseload power plants (nuclear, hydro — predictable, cheap per kWh) and buy peaking capacity on the spot market (gas turbines — flexible, expensive per kWh).
| Scenario | H100 Residual (3yr) | Value | Implication |
|---|---|---|---|
| Bull case | 40% | ~$11K | Ownership strongly favored |
| Base case | 20% | ~$5.5K | Ownership favored at high utilization |
| Bear case | 5% | ~$1.4K | Next-gen makes it essentially worthless |
Buy side TCO per GPU-hour: Purchase price amortized over useful life + cost of capital + power/cooling + maintenance (2-5% annual failure rate) − residual/salvage value.
Depreciation tension: H100 straight-line over 3 years = ~$0.95/hr. Over 5 years = ~$0.57/hr. But GPU tech cycles are accelerating: H100 (2022) → H200 (2024) → B200 (2025). If Blackwell delivers 3-5x performance per dollar, H100 depreciating over 5 years is economically obsolete before fully depreciated.
Rent side: ~$2.00-3.00/hr for H100. No upfront CapEx. Flexibility to scale. Obsolescence risk sits with the lessor — but you’re paying their margin.
If blended WACC for GPU purchases is 12%, the $28K H100 has economic cost of $3,360/year in capital charges alone — adding ~$0.38/hr to ownership cost, pushing fully loaded from $1.10 to ~$1.48/hr.
Breakeven utilization moves from 44% to ~59%. If funded with pure equity at 25%, breakeven jumps to ~76%. Interest rate environment matters: 300-400 bps difference on $1B GPU purchase = $30-40M/year in additional interest expense.
Renting = buying a monthly call option on compute. You pay a premium but maintain optionality to switch hardware, scale down, or pivot. Option value increases when volatility is high (AI hardware changing every 12-18 months), time horizon is uncertain, and interest rates are high.
Buying is attractive when: Demand is highly predictable (Crusoe’s 15-year Abilene lease), structural cost advantage exists (cheap power extends economic life), and the asset can be redeployed across use cases.
→ GPU Memory Hierarchy detailsThe optimal strategy: heavy ownership of base-load capacity funded by low-cost infrastructure debt secured against long-term contracts, combined with rental/spot capacity for flexibility.
Like utilities: own baseload plants, buy peaking capacity on the spot market. The owned portion provides cost advantage; the rented portion provides optionality. The ratio depends on demand predictability and cost of capital.
Crusoe's ~$50/kW/month energy advantage at a 100MW facility = $60M/year in structural cost savings that flow directly to margin or competitive pricing.
Real estate for aluminum smelters — you don’t price by square footage because the smelter’s value is entirely determined by access to cheap electricity. Same with AI data centers.
| Facility Type | $/kW/month |
|---|---|
| Wholesale colocation (legacy) | $80-120 |
| Retail colocation | $120-180 |
| AI-optimized facility | $150-250+ |
| Hyperscaler self-build | $50-80 effective |
Traditional operator at $0.10/kWh: 1 kW continuous for a month = $72 just in electricity. Out of $150/kW/month price, nearly half is power. Crusoe at $0.03/kWh: same 1 kW = $22/month.
Legacy server rack: 2-4 kW. AI rack with 8 H100s: 40-70 kW. GB200 NVL72 rack: 120-140 kW. That’s 30-70x more power but roughly the same physical footprint. Pricing by square footage breaks completely.
Power delivery infrastructure alone costs millions: transformers ($1-5M each), switchgear, UPS, backup generators, distribution. Redundancy requirements (N+1 or 2N) mean 100 kW provisioned → 200 kW built.
Every watt consumed becomes a watt of heat to remove. At 140 kW/rack, air cooling physically cannot keep up — liquid cooling is required. Cost scales linearly with kW.
PUE captures this overhead. PUE of 1.1 (liquid) means 10% cooling overhead. PUE of 1.4 (air) means 40%. The difference at 100MW: an extra 30MW consumed just for cooling, or ~$15M/year at grid rates.
The $/kW/month price bundles: power delivery infrastructure (transformers, switchgear, UPS, generators, distribution), cooling capacity (linearly proportional to power), redundancy (N+1 or 2N), and physical space (essentially free at this point — tiny fraction of total cost).
This is why the metric is $/kW/month: it captures the actual scarce resource (power capacity) rather than the abundant one (floor space).
Equity costs 3-4x more than debt because equity holders bear residual risk with no contractual protection. The tax shield on debt interest makes the gap even wider.
A building with floors. Debt holders live on the ground floor — if the building sinks, they’re last to get wet. Equity holders live in the penthouse — great view, but first to feel any earthquake. Higher floor = higher risk = higher rent.
| Dimension | Debt | Equity |
|---|---|---|
| Priority in liquidation | First (senior) | Last (residual) |
| Cash flows | Contractual (fixed interest) | Residual (dividends optional) |
| Upside | Capped at interest rate | Uncapped |
| Downside | Protected by covenants | Can lose everything |
| Tax treatment | Interest is deductible | Returns are not |
| Typical cost | 7-10% | 20-30%+ |
Interest payments are tax-deductible. At 21% corporate tax rate, 8% interest → ~6.3% after-tax cost. Equity returns have no such benefit.
On $5B of debt at 8%, the tax shield is worth $84M/year — pure value creation from choosing debt over equity for the same investment.
Debt holder’s max upside is the interest rate (8%). They price for downside protection via covenants, collateral, and priority. Equity investors face uncapped upside and uncapped downside — they need much higher expected returns to justify the risk.
Information asymmetry compounds this: debt holders protect themselves with covenants, while equity holders have board seats but can’t contractually force profitability. More trust required = riskier = more expensive.
If debt is cheaper, why not fund everything with debt? Three limits:
1. Financial distress risk: Too much leverage makes a temporary downturn existential. Missing one interest payment can trigger default cascades.
2. Debt capacity limits: Lenders won’t fund beyond cash flow coverage. The DSCR sets a hard ceiling.
3. Rising marginal cost: First billion at 8%, fifth billion at 12%. At some point, additional debt costs more than equity.
A 40% commitment discount looks expensive in isolation, but the contracted revenue it creates can unlock debt capacity whose NPV far exceeds the discount given. Pricing decisions directly affect capital structure.
Getting a mortgage. The bank doesn’t just look at the house (collateral) — they look at your salary and employment contract (contracted revenue). A tenured professor with $100K salary gets a bigger mortgage at a lower rate than a freelancer earning $150K with no contract.
Lenders size loans using DSCR — want cash flow at 1.3-2.0x annual debt service.
Without contracted revenue: Lender underwrites to conservative $300M → $150M FCF → supports ~$100M/year debt service → ~$1-1.5B total debt capacity.
With 15-year Oracle contract at $600M/year: Lender underwrites to $600M → $400M FCF → supports ~$267M/year debt service → ~$3-4B total debt capacity. Same facility, dramatically different borrowing.
Contracted revenue with investment-grade counterparty compresses spread: SOFR + 400bps → SOFR + 250bps. On $5B = $75M/year in interest savings.
Looser covenants: fewer restrictions on additional borrowing, less stringent maintenance ratios, more operational flexibility. Longer tenor: 15-year contracted revenue supports 10-12 year debt vs 3-5 years for uncontracted.
Lenders structure assignment of contracts as security interest. If Crusoe defaults, JPMorgan could step into Crusoe’s position and receive Oracle’s lease payments directly.
Like mortgage-backed securities — the payment streams themselves are the security. Not the physical assets, but the right to receive contracted cash flows.
Every 1-year or 3-year GPU reservation contract signed makes the debt package more attractive to lenders → cheaper debt → lower WACC → more competitive pricing → more customers → more contracts → even more debt capacity.
This is why commitment discounts aren’t just about revenue: they’re a capital structure optimization. The PM must model the full-cycle NPV, not just the direct pricing impact.
Debt requires predictability. Equity tolerates uncertainty. Crusoe runs two capital structures in one company: project-finance-funded infrastructure + venture-funded software.
A person’s financial life. College student = 100% “equity” (parental support, scholarships). First job = some debt (car loan). Established career = mortgage, credit lines. Retired = optimized portfolio. More predictable income unlocks more leverage.
Series A: No revenue, no assets, no track record. Might not exist in 18 months. Cost of equity: 50-100%+ implied (VCs need 100x potential). Debt is unavailable. Mix: 95-100% equity. Exception: venture debt (20-30% of last equity round) with warrants.
Series B: $5-20M ARR, real product, paying customers. Cost of equity: 30-50%. Cost of debt: 12-15% (venture debt with warrants). Mix: 80-90% equity. Equipment financing becomes possible (80% LTV on GPU purchases).
Series C/D: $50-200M+ ARR, proven unit economics, enterprise contracts. Cost of equity: 15-25%. Cost of debt: 8-12% (term loans, asset-backed). Mix: 60-80% equity. CoreWeave pioneered billions in GPU-backed debt at this stage.
Pre-IPO: $200M-1B+ revenue, clear profitability path. Cost of equity: 12-20%. Cost of debt: 6-9%. Mix: 40-60% equity. Convertible notes popular: lower interest (2-5%) with embedded call option. Crusoe is here now (Series E, $10B+).
Cost of equity drops dramatically: 10-15% (liquidity, transparency, diversifiability). Cost of debt: 4-7% (investment-grade bonds, commercial paper). Gap narrows but never closes — debt holders are always paid first.
Active capital management: share buybacks, debt-funded buybacks, dividend policy, credit rating management. The CFO becomes a portfolio manager optimizing the capital structure continuously.
Crusoe runs two capital structures simultaneously:
Project-finance infrastructure: Abilene’s $9.6B debt + $5B equity. Low cost of capital, asset-heavy, contracted cash flows. Like a utility or pipeline company.
Venture-funded software: Cloud platform, managed inference. High cost of capital, asset-light, uncertain returns. Like a typical tech startup.
The PM must understand which investments belong to which bucket. Managed inference feature ($20M) → equity-funded, needs venture-scale returns. Data center with signed contract → leverage cheap debt, lower hurdle rate.