TL;DR — the 2026 cost envelope
The compute side of fine-tuning has fallen sharply for two years in a row. The new bottleneck is dataset quality and evaluation, not GPU cost. A production-grade LoRA programme on a 7B–13B open-weights model now lands between USD 30,000 and USD 180,000 end-to-end. Full fine-tunes on 70B+ models still routinely exceed USD 250,000 when you include the dataset, eval harness, MLOps, and the first six months of maintenance.
| Programme | Compute only | End-to-end (with data + eval + ops) |
|---|---|---|
| LoRA 7B-13B, narrow task | USD 200–1,500 | USD 30–80k |
| LoRA 70B, instruction adapt | USD 1,500–6,000 | USD 60–180k |
| Full FT 7B-13B | USD 1,500–15,000 | USD 60–200k |
| Full FT 70B | USD 25–90k | USD 180–450k |
| Continued pre-train, 70B, 50B tokens | USD 180–420k | USD 400k–1.2M |
GPU-hour pricing across H100, H200, B200, A100
GPU pricing in 2026 is unrecognisable compared to the 2023 panic-buy era. Three forces collapsed prices: H100 supply finally catching demand in H2 2025, B200/GB200 entering general availability in Q1 2026, and the rise of neoclouds (CoreWeave, Lambda, RunPod, Crusoe, FluidStack, Vast.ai) running at materially lower margins than hyperscalers.
| GPU | Hyperscaler on-demand | Neocloud on-demand | Neocloud spot |
|---|---|---|---|
| A100 80GB | USD 2.20–3.20 | USD 1.20–1.80 | USD 0.80–1.40 |
| H100 80GB SXM | USD 2.80–4.20 | USD 1.80–2.60 | USD 1.20–1.80 |
| H200 141GB | USD 3.50–5.00 | USD 2.40–3.40 | USD 1.80–2.40 |
| B200 / GB200 (early access) | USD 5.50–8.00 | USD 4.00–6.00 | limited |
| MI300X | USD 2.90–4.00 | USD 1.90–2.80 | USD 1.30–1.90 |
Two pricing dynamics deserve calling out. First, B200 looks expensive on paper but delivers roughly 2.0–2.5x throughput over H100 on FP8 training and 3–4x on FP4 inference. The per-token cost on a 70B fine-tune is now usually lower on B200 than on H100 despite the higher hourly. Second, MI300X with ROCm 6.2+ has reached real production parity for LLaMA, Mistral, Qwen and Gemma fine-tuning; if your team can swallow the slightly thinner ecosystem, you save 10–25%.
LoRA, QLoRA, DPO, full fine-tune — cost per method
Five methods cover 95% of 2026 fine-tuning work. Pick by the shape of the problem, not by what your team most recently read about.
- Supervised fine-tuning (SFT) with LoRA / QLoRA. Train low-rank adapters (rank 8–64) on top of frozen base weights. 0.1–3% of parameters updated. QLoRA adds 4-bit base-model quantisation, slashing VRAM by ~4x. Cost: 1–5% of full SFT. Default choice.
- Full SFT. Update all parameters. Required when you change tokenizer, vocabulary, or do continued pre-training. 20–50x more VRAM than LoRA — you need ZeRO-3 / FSDP across multiple nodes for anything above 13B.
- Direct Preference Optimisation (DPO) and variants (IPO, KTO, ORPO). Aligns the model against preference pairs without a separate reward model. Cost: 1.5–3x SFT on the same dataset. Required when style, safety, or refusal behaviour matters.
- Continued pre-training. Tens to hundreds of billions of new tokens of domain corpus. Cost dominated by data acquisition (USD 50–500k for a clean specialist corpus) and compute (USD 100–500k for 50B tokens on a 70B model).
- Reinforcement learning from verifiable rewards (RLVR), GRPO, RLHF. 2026's hot direction for reasoning models. Cost 3–8x SFT for comparable wall-clock; the eval and reward-model infrastructure dominates total spend.
Dataset curation: the largest line item nobody budgets
In every audit we run on a stalled fine-tuning programme, the dataset is the gating issue. The internal estimate at the start is invariably 5–10x too low. A realistic 2026 cost stack for a 30,000-pair high-quality instruction dataset in a regulated domain:
| Activity | Cost range | Notes |
|---|---|---|
| Sourcing and rights clearance | USD 2–15k | Counsel review, licensing of third-party corpora, CDSM Article 4(3) opt-out checks for EU. |
| PII / PHI redaction pipeline | USD 3–8k | Presidio + custom regex + LLM-assisted review; mandatory for HIPAA, GDPR Article 5 data minimisation. |
| Annotation labour (SME) | USD 6–25k | USD 20–120/hour depending on domain; legal, medical, finance at the top. |
| Synthetic data generation | USD 1–6k | Claude Opus or GPT-4o calls + verification; cost compresses fast on Sonnet/Haiku for verification. |
| Inter-annotator agreement and adjudication | USD 1–4k | 10–20% double-labeled, third-party adjudication on disagreements. |
| Dataset eval and decontamination | USD 1–3k | n-gram overlap against held-out eval, MinHash near-duplicates, contamination against MMLU/HumanEval/etc. |
Total for a serious 30k-pair dataset: USD 14–61k. For 100k+ pairs in a regulated domain, expect USD 40–180k. This is why we tell clients during fine-tuning engagements that the dataset budget should be 3–6x the compute budget, not the other way around.
Evaluation infrastructure: don't ship blind
The fastest way to lose money on fine-tuning is to ship a model whose quality you cannot measure. Eval infrastructure for a serious programme:
- Frozen test set — 500–2,000 examples, never seen in training, versioned, hashed in CI.
- Production traffic replay set — 1,000–5,000 anonymised real prompts, refreshed monthly.
- Bias slices — per-group performance to satisfy EU AI Act Article 10(2)(f) and GDPR Article 22 explanations.
- LLM-as-judge harness — Claude or GPT-4-class judge with hand-validated rubrics; correlation against human judges measured quarterly.
- Public benchmarks where relevant — MMLU-Pro, MATH, HumanEval+, IFEval, MT-Bench v2, plus a domain-specific benchmark you build once and reuse.
Setup cost: USD 3–15k. Per-eval cost on a serious harness: USD 200–1,000 in LLM-judge calls. Budget USD 800–3,000/month for continuous eval against production traffic.
Worked examples: 7B, 13B, 70B end-to-end budgets
Three real programmes we ran in 2025–2026, with numbers cleaned of client specifics:
Example A — LoRA on Qwen2.5-7B for legal document extraction
- Dataset: 14,000 hand-labeled extraction pairs from contract corpus. Annotation by paralegals at USD 45/hour blended. Dataset cost: USD 38,000.
- Compute: 8xH100 spot for 6 hours per training run, 14 runs across hyperparameter sweep + DPO pass. USD 1,150.
- Eval harness: USD 6,200 setup, USD 1,800/month ongoing.
- MLOps and engineering: 6 weeks senior engineer at USD 180/hr blended. USD 43,200.
- Total programme: USD 88,550. Replaced a USD 22k/month GPT-4o pipeline; break-even in month 5.
Example B — QLoRA on Llama-3.3-70B for customer-support voice
- Dataset: 22,000 historical support tickets with curated agent responses; synthetic augmentation 3x. Cost: USD 26,000.
- Compute: 4xH200 on neocloud for 9 hours per run, 8 runs. USD 1,400.
- Eval + ops: USD 9,800 setup, USD 2,200/month ongoing.
- Engineering: 8 weeks. USD 57,600.
- Total: USD 94,800. Reduced average handle time by 31%; payback in 4 months on labour savings alone.
Example C — Full FT on Mistral-Small-22B for clinical scribe
- Dataset: 48,000 de-identified clinical dictation pairs; HIPAA-controlled pipeline. Cost: USD 142,000.
- Compute: 32xH100 FSDP, 18 hours per run, 5 runs. USD 13,500.
- Eval (medical SME-graded) and compliance: USD 31,000.
- Engineering, MLOps, HIPAA review: USD 118,000.
- Total: USD 304,500. Frontier API was not an option (BAA-blocked in this configuration); the fine-tune is the product.
Inference economics and the break-even against frontier APIs
The fine-tune costs of training are dwarfed over a model's life by inference cost. Run the math early.
A 13B fine-tune served on a 2xH100 vLLM instance at 80% utilisation delivers roughly 12–20 million output tokens/day at a cost of USD 95–150/day. That is USD 0.005–0.012 per 1k output tokens, against USD 0.60–15.00 per 1k for frontier APIs — a 50–1500x advantage at scale. A 70B fine-tune on 4xH100 lands at USD 0.02–0.06 per 1k tokens.
Break-even rule of thumb: a USD 80–120k fine-tune programme pays back inside 3–6 months once you exceed USD 25,000/month in frontier-API inference. Below USD 5,000/month, prompting a frontier model wins on TCO; do not fine-tune.
Ongoing maintenance and drift
A fine-tuned model is not a finished product. Plan USD 8–25k per quarter:
- Re-evaluation against frozen and refreshed test sets — USD 1–3k.
- Drift monitoring on production traffic (embedding-distance, semantic similarity, refusal-rate, hallucination-rate) — USD 1–3k.
- Incremental dataset growth and re-labeling on hard cases — USD 3–10k.
- One re-train cycle per quarter — USD 2–30k depending on method.
- Base model migration when better open weights drop (2–3x per year in 2025–2026) — one-time USD 8–40k.
Compliance overhead: GDPR, EU AI Act Article 53, SOC 2
Fine-tuning interacts with three compliance frameworks more than people expect:
- GDPR. Article 5 data-minimisation, Article 25 privacy by design, Article 28 processor agreements with annotation vendors, Article 32 security of processing, Article 35 DPIA for high-risk processing. PII in training data is a strict no — redact or synthesise.
- EU AI Act Article 53. If you fine-tune an open-weights model and redistribute, you are a GPAI provider. You owe Annex XI technical documentation, Annex XII downstream-provider information, a copyright policy honouring CDSM Article 4(3) opt-out, and a public training-data summary on the AI Office template. We covered the detail in our EU AI Act SaaS checklist.
- SOC 2 / ISO 27001:2022. Annex A.5.34 (privacy and protection of PII), A.8.10 (information deletion), A.8.11 (data masking), A.8.28 (secure coding) all apply to your training pipeline; auditors are catching up fast.
For HIPAA-bound work, the BAA chain (you → cloud → GPU provider) must hold all the way down. AWS, GCP and Azure offer BAA on H100/H200 SKUs; most neoclouds do not. That cost premium is real and unavoidable for PHI fine-tunes.
Top 10 cost mistakes we see in client audits
- Defaulting to full fine-tune when LoRA would do — 10–30x compute waste.
- Hyperparameter sweeps with no early-stopping — 3–6x sweep cost.
- Running on-demand hyperscaler when spot or neocloud was fine — 2–4x compute cost.
- No eval harness — ship and pray, then re-train from scratch when it underperforms.
- Annotation labour booked to "engineering" budget, never tracked as data cost.
- No contamination check against public benchmarks — inflated eval scores, real-world failure.
- Training set leaks PII / PHI; counsel forces re-do.
- No frozen test set; eval scores drift as test set drifts.
- Choosing a base model going EOL in 6 weeks — re-train forced.
- No inference cost model before training starts — "we fine-tuned a 70B and now serving costs 4x the API we replaced".
If you are weighing a fine-tuning programme against frontier APIs or RAG, our LLM fine-tuning & MLOps team runs a fixed-price two-week feasibility — dataset audit, method recommendation, GPU-hour estimate, ROI model, EU AI Act delta. For broader AI architecture decisions across SaaS development and custom software contexts, a fractional CTO with shipped MLOps experience usually pays for itself in the first month.
FAQ
How much does it cost to fine-tune an open-weights LLM in 2026?
LoRA on a 7B-13B model: USD 200–1,500 in compute; USD 30–80k end-to-end. LoRA on 70B: USD 1,500–6,000 compute; USD 60–180k end-to-end. Full fine-tunes 5–15x more.
LoRA vs full fine-tuning?
Default to LoRA / QLoRA. Matches full-FT quality in 85–95% of cases at 1–5% of compute and storage. Full FT only when changing tokenizer/vocabulary or doing continued pre-training.
What is the going GPU-hour price in 2026?
H100 80GB on neocloud spot USD 1.20–1.80; on-demand USD 1.80–2.60. H200 USD 2.40–3.40 on-demand. B200 USD 4.00–6.00 on neocloud but 2–2.5x throughput. A100 spot USD 0.80–1.40 still cost-optimal for small LoRA.
How big a dataset do I actually need?
LoRA instruction-tuning: 1k–10k high-quality pairs beats 100k noisy. Domain Q&A: 5k–30k real conversations. Classification/extraction: 2k–10k per class with strong inter-annotator agreement.
When does ROI justify a fine-tune?
Under USD 5k/month API spend — don't fine-tune. USD 5k–25k — only if narrow. Above USD 25k/month, or where latency or data residency forces it — almost always yes.
What does ongoing maintenance cost?
USD 8–25k per quarter: re-eval, drift monitoring, incremental data, one re-train. Teams that skip maintenance lose 4–9 percentage points of quality per quarter.
Build the dataset like it's the product. The model is the artefact.
The single highest-leverage change we make in fine-tuning audits is reallocating budget from compute to data. Spend 60–70% of programme dollars on dataset curation, eval, and labelling; spend 5–15% on compute; spend the rest on MLOps. Teams that flip this ratio ship models that miss; teams that respect it ship models that compound.
Last updated 26 May 2026. Prices reflect publicly observable on-demand and spot pricing across major hyperscalers and neoclouds as of May 2026 and may move sharply. Nothing in this article constitutes legal or investment advice.


