Daniel Reyes, YuSMP Group
Daniel Reyes Principal Engineer (AI/ML), YuSMP Group · LLM systems, RAG and fine-tuning for production

The 60-second answer

By mid-2026 the enterprise AI agent stack has converged. The defaults that work:

  • Model: Claude 4.6 Sonnet as the workhorse, Opus/o3 for hard steps, Gemini 2.5 Pro for long context, DeepSeek V3 or Llama 4 for cost-sensitive bulk.
  • Orchestration: LangGraph for agentic state machines, LlamaIndex when retrieval is central, DSPy for prompt/pipeline optimisation. Anthropic SDK with tool use for the simplest cases.
  • Integration: MCP servers for every tool, reused across agents and clients.
  • Evals: Braintrust, Langfuse or Phoenix. 50–500 golden tasks. Run on every change.
  • Observability: OpenTelemetry traces with token cost, tool calls and latency per step.
  • Compliance: EU AI Act Article 4 (AI literacy) for everyone; full Article 6 risk management for high-risk use cases.

Model landscape and pricing in 2026

The model picture in mid-2026 is more stable than it has been since 2023. Five families lead, each with a clear job to do.

ModelIn / Out per 1MSWE-bench VerifiedBest for
Claude 4.6 Opus$15 / $75~74%Hardest planning, complex code, long-horizon agents
Claude 4.6 Sonnet$3 / $15~70%Default workhorse for tool-using agents
OpenAI o3$10 / $40~71%Multi-step reasoning, maths, structured planning
GPT-4o$2.50 / $10~55%Multimodal, voice, fast responses
Gemini 2.5 Pro$1.25 / $5~63%2M-token context, bulk doc analysis
Mistral Large 3$2 / $6~52%EU residency, multilingual
Llama 4 (Bedrock)$0.90 / $2.70~50%Self-hostable, fine-tunable
DeepSeek V3$0.27 / $1.10~49%Cost-sensitive bulk, RAG generation

MMLU is now saturated above 88% across all frontier models — not a useful discriminator. GPQA Diamond (graduate-level science reasoning) and SWE-bench Verified (real GitHub issues) are the 2026 benchmarks that actually predict agent performance.

Orchestration layer — LangGraph, LlamaIndex, DSPy

Each of the major orchestration frameworks now has a clearly best home:

  • LangGraph (LangChain). Stateful, graph-based agent execution. The default for tool-using agents with branching control flow. Built-in checkpointing, time-travel debugging, human-in-the-loop.
  • LlamaIndex. The right choice when retrieval is the central concern — document Q&A, structured-data RAG, knowledge-base agents. Excellent ingestion connectors, mature reranking, native MCP support since v0.12.
  • DSPy (Stanford). Programmatic prompt and pipeline optimisation. You write the structure, DSPy learns the prompts via metrics. Best for narrow pipelines where you can define an objective function.
  • Anthropic SDK direct. For the simplest tool-use agents (one or two tools, no branching), going framework-free with the Anthropic SDK’s tool-use loop is faster and easier to reason about than any framework.

Production stacks frequently combine: LangGraph as the outer state machine, LlamaIndex as the retrieval substrate, DSPy to optimise a critical sub-prompt against a metric. The frameworks are interoperable.

MCP — the integration standard that actually stuck

Anthropic’s Model Context Protocol shipped in late 2024 and by mid-2026 has effectively become the default integration standard for agent tooling. The major IDE-agent clients (Claude Desktop, Cursor, Continue, Windsurf, JetBrains AI), the major frameworks (LangChain, LlamaIndex, Mastra) and the major hosted-agent platforms all speak it.

For enterprise this is significant. The 2024 reality of writing custom tool integrations per agent and per client has collapsed into: build one MCP server per system (Jira, ServiceNow, Salesforce, SharePoint, Confluence, S3, Snowflake), expose its tools and resources, and every compliant client can use it.

Enterprise MCP servers we ship most often:

  • Internal knowledge bases (Confluence, Notion, SharePoint) with row-level auth.
  • Ticketing / project management (Jira, Linear, ServiceNow) with audit logging.
  • CRM (Salesforce, HubSpot) with field-level access control.
  • Data warehouse (Snowflake, BigQuery, Databricks) with parameterised query templates.
  • Internal microservices via OpenAPI → MCP adapters.

Five enterprise agent patterns that ship

  1. Knowledge-worker copilot. Embedded in the user’s tool of choice (Slack, Teams, IDE, web app). RAG over corporate docs + a few tools. Cost: $4–18 per seat/month at scale. Volume use case in 2026.
  2. Customer-support deflection agent. Frontline agent for tier-1 tickets, hands off to humans on uncertainty signal. Saves 30–55% of tier-1 volume in our deployments. Critical: confidence threshold + clean handoff, not full automation.
  3. Sales-research agent. Pre-meeting brief, account research, CRM enrichment. Read-only by design. Saves 40–90 minutes per AE per day in mid-market sales orgs.
  4. Engineering agent (code review, ticket triage, PR drafting). Cursor + custom MCP servers + GitHub Actions. Real productivity gains; aggressive evals required.
  5. Operations agent. Internal IT, HR onboarding, procurement triage. Highest ROI per seat because it replaces ticketing-system theatre with conversations.

What does not ship reliably yet: fully autonomous “run my business for me” agents, long-horizon planning past ~30 steps without human checkpoints, and any high-stakes decision (hiring, credit, medical) without human-in-the-loop.

Evals — the discipline most teams skip

The single largest reason enterprise agent projects fail in 2026 is the absence of a real eval discipline. Without evals, every prompt change is a guess, every model swap is a regression risk, and every customer complaint requires forensic reconstruction.

The minimum eval discipline for production:

  1. Build a golden set of 50–500 representative tasks. Each has an input, an expected output (or a pass/fail rubric), and a category tag.
  2. Run the eval on every prompt change, every model change, every tool change. Block deploys on regression.
  3. Track three numbers in CI: task success rate, average steps per success, cost per success. All three should be stable or improving.
  4. Use a tool: Braintrust (commercial, best dev experience), Langfuse (open source, EU-friendly), Phoenix (Arize, open source), Anthropic’s built-in evals API.
  5. Add LLM-as-judge for outputs that are not directly verifiable. Use a different model as judge (Claude 4.6 Sonnet judging GPT-4o output is common). Calibrate judge against 30 human-labelled outputs.

Cost engineering — routing, caching, distillation

Inference cost is the new COGS. The four 2026 techniques that dominate cost engineering:

  1. Routing. Per-step model selection. Cheap, capable model (Sonnet) for the bulk, expensive reasoning model (Opus, o3) for hard steps. Typical savings: 40–70%.
  2. Prompt caching. Anthropic caches input tokens at 90% discount; OpenAI at 50%. For workloads with stable system prompts and document context, caching saves 30–55% of input cost.
  3. Distillation. Log Claude/GPT outputs, fine-tune a small open model (Llama 4 8B, Mistral 7B) on narrow tasks. 5–15× cheaper inference at 92–97% quality retention. See LLM fine-tuning & MLOps.
  4. Structured output. Constrained decoding (JSON schema, regex, BAML) reduces token spend and retry rates simultaneously.

Combined, these techniques routinely bring a $25–80 per-seat-per-month bill down to $4–18 without measurable quality loss.

Observability for agents

Agents are non-deterministic state machines making tool calls. Logs alone do not cut it. The minimum:

  • OpenTelemetry traces with one span per LLM call and per tool call. Attributes: model, input tokens, output tokens, cost, latency, success.
  • Per-agent dashboards: success rate, p95 latency, p95 cost, top failing tools, top noisy prompts.
  • A “session replay” UI — for any agent run, see the full transcript, tool calls and intermediate state. Indispensable for debugging.
  • Cost budgets per tenant, per agent and per user with hard caps.

Langfuse, Helicone, Honeycomb, Datadog APM all now support agent-aware tracing.

Security, prompt injection and data exfiltration

Prompt injection is a real attack surface, not a hypothetical. The 2026 threat model assumes any external content the agent reads (emails, web pages, docs) may contain injected instructions trying to exfiltrate data via tools the agent has access to.

Defences that work:

  • Least-privilege tools. Read-only by default. Write capabilities require explicit human confirmation.
  • Tool allow-lists per context. An agent reading user email should not have access to file-write or external-HTTP tools.
  • Output filtering. Egress filter on tool outputs going back to the model; block obvious exfiltration patterns.
  • Human-in-the-loop on high-impact actions. Anything irreversible (delete, send, transfer) requires confirmation.
  • Per-tenant isolation. No cross-tenant data in retrieval. RLS at the data layer.
  • Audit logs. Every tool call recorded with input, output, user, agent, model.

EU AI Act — what enterprise actually has to do

The EU AI Act’s general obligations apply from August 2026, with high-risk system obligations applying from August 2027. By mid-2026, enterprise should already have:

  • AI literacy programme (Art. 4). Documented training for staff using AI systems. Applies to almost every company in the EU.
  • Inventory of AI systems. Internal and procured. Classification per Act risk levels.
  • For high-risk uses (recruitment, credit, education, critical infrastructure, law enforcement, biometric identification): conformity assessment, risk management system, data governance, technical documentation, human oversight, accuracy/robustness/cybersecurity controls, post-market monitoring.
  • For GPAI integration (calling Claude, GPT, Gemini): downstream-deployer obligations, copyright disclosure, training-data summary access.

For most enterprise productivity agents the obligations are documentation, logging and the literacy programme. For genuinely high-risk uses, plan a 6–12 week compliance workstream. See EU AI Act compliance.

Reference architecture

A pragmatic 2026 enterprise agent reference stack:

  • Client surface: Slack/Teams bot, web app, IDE plugin, or REST API.
  • API gateway: Auth via WorkOS or Clerk; per-tenant rate limit; OpenTelemetry context propagation.
  • Orchestrator: LangGraph state machine (Python or TypeScript). Per-step model routing.
  • Tool layer: MCP servers per system. One server per integration, reused across agents.
  • Retrieval layer: LlamaIndex over Postgres (pgvector), Qdrant, or Weaviate. Per-tenant scoping enforced at the index level.
  • Model layer: Anthropic API for Claude, OpenAI for GPT-4o/o3, Google Vertex for Gemini, Bedrock for Llama 4, hosted Mistral for EU residency. Router lives in the orchestrator.
  • Eval pipeline: Braintrust or Langfuse on every change in CI.
  • Observability: OpenTelemetry → Datadog or Grafana Cloud. Langfuse for agent-level traces.
  • Cost budgets: Per-tenant + per-user, with hard caps enforced in the orchestrator.
  • Compliance: Audit log to S3 + Glacier; documented data flows; AI literacy programme; DPIA for high-risk.
Enterprise AI engineering team reviewing agent traces
The stack converged. The discipline (evals, observability, cost engineering, EU AI Act) is what now separates a demo from a deployment.

FAQ

Which model should I use for enterprise AI agents in 2026?

Claude 4.6 Sonnet as the workhorse, Opus/o3 for hard steps, Gemini 2.5 Pro for long context, DeepSeek/Llama 4 for cost-sensitive bulk. Route per step.

What is MCP and do I need it?

Anthropic’s Model Context Protocol, the integration standard that stuck. One MCP server per system reused across agents and clients. Default for new work.

LangChain, LlamaIndex or DSPy?

LangGraph for agentic state, LlamaIndex for retrieval-heavy, DSPy for optimised pipelines. Often combined.

How much does an enterprise AI agent cost to run?

$4–18 per seat per month with routing + caching; $25–80 without.

What does the EU AI Act mean for enterprise agents?

AI literacy + logging for most uses; full conformity assessment for high-risk. Plan 6–12 weeks for high-risk compliance.

How do I evaluate an agent before shipping?

50–500 golden tasks. Run on every change. Track success rate, steps per success, cost per success. Ship only when stable across two runs.

Ship a real enterprise agent

Senior engineers who have shipped agents in FinTech, HealthTech, LegalTech and B2B SaaS. Evals, cost engineering and EU AI Act-aware from day one.

Last updated 26 May 2026.