The two-line answer
If your knowledge changes faster than once a quarter, use RAG. If you need a specific output format, latency under 200 ms, or domain-specific reasoning the base model cannot reliably produce, fine-tune. Serious products use both: a fine-tuned small open model as the reasoning engine, with RAG supplying fresh, citation-grade facts. This is the boring, correct answer in 2026 and the one we ship 9 out of 10 times for clients of our RAG-as-a-Service practice.
The decision frame in 2026
The argument between RAG and fine-tuning has been muddied by a year of marketing. Long-context models (Claude 4.6 Sonnet at 1M tokens, Gemini 2.5 Pro at 2M) led some teams to claim "RAG is dead." It is not, and it will not be. Long context shifts the boundary; it does not erase it. The frame we use to choose:
- How fresh must the answer be? If facts change between training cuts and inference, you cannot fine-tune them in. RAG or tool use is the only honest answer.
- How big is the corpus? Below ~50,000 tokens of stable knowledge, in-context prompting is fine. Above that, RAG begins to pay.
- How specific is the output? If you need strict JSON, a custom DSL, a tone, or domain reasoning chains, fine-tuning earns its keep.
- What are your unit economics? Below 5 M tokens/day, hosted closed models win. Above 50 M tokens/day on a stable workload, a fine-tuned 8–13B model on your own GPU is dramatically cheaper.
- What latency do you need? A self-hosted fine-tuned 8B model on H200 runs at 80–120 tokens/s for a single user with first-token latency under 150 ms. Claude 4.6 Sonnet through API sits at 600–900 ms first-token from EU.
RAG: what it actually is in 2026 and why it still dominates
Retrieval-augmented generation in 2026 is not the naive "embed, search, stuff into prompt" pipeline of 2023. A production RAG system in 2026 has at least six components, each with a real engineering decision behind it:
| Component | 2026 defaults | Why it matters |
|---|---|---|
| Ingestion / chunking | LlamaIndex / Unstructured / Haystack pipelines, semantic chunking at 400–800 tokens | 90% of RAG failures are chunking failures |
| Embeddings | Voyage-3, OpenAI text-embedding-3-large, BGE-M3 (open) | Voyage-3 leads MTEB leaderboard; BGE-M3 is the best open option |
| Vector store | Qdrant, Weaviate, pgvector, Pinecone serverless | Under 100M vectors, pgvector on the same Postgres you already run is hard to beat |
| Hybrid retrieval | BM25 + dense + metadata filter, fused via RRF | Pure dense retrieval still loses to hybrid on enterprise corpora |
| Re-ranking | Cohere Rerank 3, BGE-reranker-v2, Voyage Rerank-2 | Adds 50–80 ms but increases top-3 precision by 15–30 pp |
| Generation | Claude 4.6 Sonnet, GPT-4o, Gemini 2.5 Pro, or a self-hosted fine-tune | Pick by latency and cost, not "best benchmark" |
What changed in the last 18 months: structured retrieval. Pure semantic search over chunks loses to multi-stage pipelines that combine BM25, dense retrieval, metadata filters, and a re-ranker. We see precision@5 jump from 0.62 (naive dense) to 0.88 (hybrid + rerank) on the same corpus, and that translates directly into fewer hallucinated answers downstream.
Fine-tuning: what it actually means in 2026
Fine-tuning in 2026 splits cleanly into two camps:
- Closed-model adapter tuning. OpenAI offers fine-tuning on GPT-4o and o3-mini; Google offers tuning on Gemini 2.5 Flash; Anthropic offers fine-tuning of Claude 4.6 Haiku for AWS Bedrock customers. You upload a JSONL of examples, pay per training token, and consume via the same API.
- Open-weight fine-tuning. LoRA or QLoRA on Llama 4 (8B, 70B, 405B), Mistral Large 3, Mixtral 8×22B, Qwen 3, or DeepSeek V3. You own the weights, you control the inference, and unit cost drops dramatically at scale.
What fine-tuning is good at: format, style, domain vocabulary, and reasoning chains the base model has seen but cannot reliably reproduce. Llama 4 8B fine-tuned on 30,000 examples of your medical-coding workflow will beat Claude 4.6 Sonnet zero-shot on that workflow, while running at 3% of the cost.
What fine-tuning is bad at: teaching new facts. Despite a decade of papers, parametric knowledge insertion via fine-tuning remains unreliable. Models memorise some facts, generalise others poorly, and confabulate on edges. If you fine-tune to "teach" the model your product catalogue, you will spend three months chasing edge cases that a RAG pipeline solves in a week.
Benchmarks that matter when comparing
Public benchmarks have become a poor proxy for production performance, but a few still help when comparing base models you intend to fine-tune:
- MMLU and MMLU-Pro: general knowledge breadth. Claude 4.6 Opus and GPT-4o both sit above 90; Llama 4 70B around 84; Mistral Large 3 around 82.
- GPQA Diamond: graduate-level reasoning. o3 leads at ~88; Claude 4.6 Opus ~85; Gemini 2.5 Pro ~83.
- SWE-bench Verified: real-world software engineering. Claude 4.6 Sonnet leads at ~72%; o3 ~70%; Gemini 2.5 Pro ~65%.
- HumanEval+, LiveCodeBench: coding under contamination control.
- Your own eval set. Always. No public benchmark predicts performance on your data.
Cost: real 2026 numbers
Here is what we actually pay in May 2026, per 1M tokens, for the most common production models:
| Model | Input / 1M | Output / 1M | Context |
|---|---|---|---|
| Claude 4.6 Opus | $15 | $75 | 1M |
| Claude 4.6 Sonnet | $3 | $15 | 1M |
| Claude 4.6 Haiku | $0.80 | $4 | 200k |
| GPT-4o | $2.50 | $10 | 128k |
| o3 | $10 | $40 | 200k |
| Gemini 2.5 Pro | $1.25 | $5 | 2M |
| Gemini 2.5 Flash | $0.15 | $0.60 | 1M |
| Llama 4 70B (self-hosted, 8×H200) | ~$0.40 | ~$0.60 | 128k |
| Llama 4 8B fine-tuned (1×H200) | ~$0.10 | ~$0.15 | 128k |
| DeepSeek V3 (API) | $0.27 | $1.10 | 128k |
Fine-tuning costs in 2026, what we actually pay:
- Llama 4 8B LoRA on 50,000 examples: $200–600 per run on a rented H200 (8–24 hours at $3–5/hour).
- Llama 4 70B LoRA on 50,000 examples: $1,500–4,000 on 4×H200 over 18–36 hours.
- Llama 4 70B full fine-tune: $4,000–12,000 on 8×H200.
- GPT-4o fine-tuning: ~$25/1M training tokens via OpenAI API.
- Gemini 2.5 Flash tuning: ~$8–12/1M training tokens.
Add 30–50% for eval set construction and 2–3 iterations to converge.
The hybrid pattern most production stacks use
For mid-to-large enterprise deployments, our default architecture is:
- Generator: a LoRA-tuned Llama 4 8B or Mistral 7B-derivative, trained on 20–80k examples of the customer's domain reasoning and output format. Hosted on a single H200 or split with vLLM for throughput.
- Retriever: hybrid pgvector + BM25, with metadata filters and Cohere Rerank 3.
- Router: a tiny Claude 4.6 Haiku call decides whether to answer from prior context, hit retrieval, or escalate to a stronger model.
- Escalation: Claude 4.6 Sonnet or o3 for the 5–10% of queries that need deeper reasoning.
- Glue: DSPy for prompt optimisation, MCP servers for clean tool boundaries, Anthropic SDK for the escalation client.
This typically lands at $0.30–$0.80 per 1,000 user interactions all-in, vs. $1.50–$4.00 for a pure Claude 4.6 Sonnet pipeline doing the same work — and gives you a model you actually own.
Reference stack we ship
- Ingestion: LlamaIndex + Unstructured (PDFs, DOCX, slides, scanned forms), Haystack for pipeline orchestration when graph processing is heavy.
- Vector DB: pgvector (under 100M vectors), Qdrant (above 100M or multi-tenant), Weaviate where graph + vector matters.
- Embeddings: Voyage-3 (closed, leader on MTEB) or BGE-M3 (open).
- Re-ranker: Cohere Rerank 3 (API) or BGE-reranker-v2-m3 (self-host).
- Prompt + program layer: DSPy for optimisable programs; LangChain still acceptable but increasingly replaced.
- Agent surface: MCP servers expose retrieval, tools and data sources cleanly to one or many LLM clients.
- Eval: Ragas, TruLens, plus our own held-out gold set per client.
- Observability: Langfuse, Helicone, Datadog LLM Observability.
If you want a reference build, see our GenAI Integration service and the parallel AI/ML & Data Engineering page.
Five expensive mistakes
- Fine-tuning to "teach" facts. It does not work reliably. Use RAG.
- Skipping evaluation. If you cannot measure correctness on a held-out set, you cannot improve. Build the eval before the model.
- Going straight to a frontier model when you need throughput. Claude 4.6 Opus on a high-volume internal workload burns money you could spend on engineers. Start with Haiku or Gemini Flash, escalate only when accuracy demands.
- Naive chunking. Fixed-size chunks slice tables and code. Use semantic + structural chunking. Test with real documents from day one.
- Ignoring the EU AI Act. If you deploy in the EU, your RAG and fine-tuning pipelines have new traceability obligations from August 2026. We cover this in detail under EU AI Act compliance.
FAQ
Is RAG always cheaper than fine-tuning?
For changing knowledge, yes. Above ~50M tokens/day on stable workloads, a fine-tuned 8–13B open model is cheaper per inference.
Does long context kill RAG?
No. Stuffing 1M tokens per request costs roughly $3 on Claude 4.6 Sonnet and adds 30–90 s of latency. RAG keeps cost and latency low.
When does fine-tuning win outright?
Specific output formats, domain reasoning the base model cannot reliably emit, or latency under 200 ms at high throughput.
What's the 2026 default stack?
LlamaIndex + pgvector or Qdrant + Cohere Rerank 3 + Claude 4.6 Sonnet, with DSPy for prompt optimisation and MCP for tool boundaries.
How much does a Llama 4 fine-tune cost?
$200–600 for an 8B LoRA on 50k examples; $4–12k for a 70B full fine-tune on 8×H200.
Can RAG and fine-tuning be combined?
Yes, and it is the production default for serious products: fine-tuned reasoning + retrieved facts.
Last updated 26 May 2026. Pricing and benchmarks reflect provider rate cards and public leaderboards as of May 2026.


