Cost to train the full Llama-3 “herd” today
We ground estimates in Meta’s public disclosures (two 24k-H100 clusters and a 54-day 405B run) and primary cloud/H100 specs,
then layer electricity, data preparation, and overhead.
~38.0M
H100-GPU-hours (herd total)
$149M–$570M
GPU rental (training only)
Low: AWS Capacity Blocks ≈ $3.933/H100-h
[AWS]
Mid: $10/H
High: $15/H
Range applied to ~38.0M H100-h (excl. experiments, data prep).
$3.0M–$6.0M
Electricity (IT+overhead)
Assumptions Dense models, H100 BF16 throughput; we use Meta’s observed 405B runtime as an anchor and scale 8B/70B by the same utilization factor. Personnel costs are excluded, per brief.
How this model is built (one screen, no hand-waving)
- Anchor reality: Meta trained Llama-3 on two custom 24k-H100 clusters; Llama-3.1-405B had a ~54-day run
[Meta-Eng], [NVIDIA-blog], [DCD].
- Compute accounting: Instead of purely theoretical 6·P·T FLOPs
[Hoffmann et al.] with an assumed MFU, we calibrate to the observed GPU-hours of the 405B run, then scale to 70B/8B.
- Prices: Primary cloud pricing (AWS Capacity Blocks effective H100-hour)
[AWS], and H100 hardware “street” prices for capex context
[Reuters].
- Power: Measured per-node draw for H100 HGX during LLM training
[Latif et al.], with industry PUE baseline
[Uptime].
- Data prep: Use public pipelines (FineWeb, GopherCite) as proxies to bound tokenization/filtering costs at Llama-scale
[Le Monde/FineWeb], [GopherCite].
Anchor
Meta used two such clusters for Llama-3.
[Meta-Eng]
Runtime
≈54 days
405B training window
Operational report on the 405B run.
[DCD]
Tokens
Llama-3.1 trained on 15.6T tokens; long-ctx add-on ~0.8T at 128k.
[Llama-3.1 TR]
1) Cost of a “big enough” H100 cluster (purchase vs. cloud)
Purchase (capex context)
H100 “street” price: recent reporting pegs units around $20k–$25k each
[Reuters].
A single 24,576-GPU cluster implies $492M–$614M for GPUs alone.
Full racks (HGX servers, network, storage, power) commonly bring total system cost well above pure GPU cost; a 2× multiplier is a reasonable planning assumption (explicit OEM quotes vary).
Cloud (opex)
AWS Capacity Blocks: effective rate ≈ $31.464 per p5.48xlarge (8×H100)-hour ⇒ $3.933 per H100-hour
[AWS].
This reserved-capacity mechanism undercuts historical on-demand tracker quotes; it’s a primary source and our “low” scenario.
Azure ND H100 v5 describes similar 8×H100 nodes (specs)
[Azure].
Rental anchor
AWS Capacity Blocks effective rate
[AWS]
2) Data: sources & preprocessing costs
Sources (who used what)
- Meta Llama-3 / 3.1: pretraining on “publicly available” data (web/code, multi-lingual); 15.6T tokens for 3.1
[Meta-Blog], [Llama-3.1 TR].
No explicit paid text licenses are disclosed; assume $0 license fees for base corpus.
- DeepSeek V2/V3 (for triangulation): multi-source corpora of 8.1T and 14.8T tokens respectively
[DeepSeek-V2], [DeepSeek-V3].
Preprocessing compute (tokenization, dedup, quality filtering)
As a proxy for Llama-scale pipelines:
the FineWeb effort reportedly used ~80k H100-hours for pipeline runs
[Le Monde];
a Gopher-style citation-filtering step alone clocked ~6,282 H100-hours
[GopherCite].
Applying 80k–100k H100-hours at cloud rates gives $0.31M–$0.39M (AWS cap-blocks) up to $0.8M–$1.0M (@$10/H).
Cost
@$3.933–$10 per H100-h
3) Training recipe (as described by the Llama-3/3.1 papers)
- Architecture: dense decoder-only Transformers with grouped-query attention (GQA), RMSNorm, SwiGLU; rotary pos. embeddings with YaRN scaling for long-context
[Llama-3.1 TR].
- Context schedule: most tokens at 8k, then long-context extension to 128k with ~0.8T tokens
[Llama-3.1 TR].
- Optimization: AdamW; cosine decay LR; global batch sizes on the order of millions of tokens; stability emphasis (no loss spikes)
[Llama-3.1 TR].
- Infra setup: H100 clusters w/ NVLink + RoCE/IB fabrics; two 24k-GPU pods used for Llama-3
[Meta-Eng], [NVIDIA-blog].
Ctx length
8k → 128k
with ~0.8T long-ctx tokens
Stability
✓
no irrecoverable loss spikes
4) Experimental runs before “the” run (overhead)
State-of-the-art runs are preceded by extensive ISO-FLOP sweeps and ablations; Chinchilla-style work reports hundreds of models in scaling studies
[Hoffmann et al.], [Databricks].
We apply a conservative +30% GPU-hour overhead to the training total.
5) Power draw & electricity
Measured peak for an 8×H100 HGX node during LLM training ≈ 8.4 kW
[Latif et al. 2024].
A 24,576-GPU job spans 3,072 nodes ⇒ ~25.8 MW IT power.
Over 54 days that’s ~33,443 MWh IT-energy; with PUE 1.5 facility energy ≈ 50,164 MWh.
At $0.06–$0.12/kWh ⇒ $3.0M–$6.0M electricity for the 405B run; the full herd (405B+70B+8B) lands about $4.8M at $0.08/kWh.
405B @$0.08/kWh
Herd @$0.08/kWh
6) How likely are these estimates correct?
- Ground truth anchor: The 31.85M GPU-hours for Llama-3.1-405B are directly implied by Meta’s own 24k-H100 cluster and the reported ~54-day run
[Meta-Eng], [DCD]. Confidence: High.
- Scale-out to 70B/8B: We scale by the utilization factor implied by the 405B run rather than rely on a fragile MFU guess
[PaLM-Bench]. Confidence: Medium.
- Price per H100-hour: We use primary AWS Capacity Blocks (low) and round numbers ($10/$15) as mid/high scenarios. Alternative trackers show volatile prices
[IEEE Index]. Confidence: Medium.
- Data prep costs: Proxying from FineWeb/GopherCite is imperfect but scale-appropriate
[Le Monde], [GopherCite]. Confidence: Medium.
- Electricity: Based on measured node draw and industry PUE averages
[Latif et al.], [Uptime]. Power pricing varies by region; our $/kWh band is conservative
[EIA]. Confidence: Medium.
- Data licenses: Meta states “publicly available” sources for Llama-3; we therefore set license fees to $0. If paid text licenses were used, add accordingly
[Meta-Blog]. Confidence: High for the assumption as stated.
Most certain
405B runtime
31.85M H100-h
Squishy
$ / H100-h
market dependent
7) Bottom line (training ≠ free)
Bucket | Low | Mid | High |
GPU rental — main training (~38.0M H100-h) | $149M | $380M | $570M |
Experiments (+30%) | $45M | $114M | $171M |
Data preprocessing | $0.31M | $0.65M | $1.0M |
Electricity (herd) | $3.0M | $4.8M | $6.0M |
Total (opex view) | $197M | $499M | $748M |
Capex context: owning a single 24,576-H100 cluster is ≈ $0.5–$0.6B for GPUs alone
[Reuters], commonly ~$1B+ fully built out (assumption). Per-run capex allocation then depends on depreciation & fleet utilization.
GPU rental
Experiments
Data prep
Electricity
Electricity share
Training spend dominated by GPU time, not power.
Data prep
At Llama scale, prep compute is modest.
References (primary where possible)