"Which model should we use?" It's one of the most common questions I get, and the most honestly answered by: it depends on what you're actually doing. The uncomfortable follow-up is that most people asking the question are comparing models on benchmark scores, which are about as useful for production decisions as judging a car on its top speed.

MMLU, HumanEval, MATH — these benchmarks are valid scientific measurements. They're just measuring something different from what you need when you're deploying an agent that processes supplier invoices at 3am, or a support bot that needs to handle an angry customer asking about a billing dispute in three languages, or an extraction pipeline that needs to pull specific fields from 40 different PDF formats reliably.

This article is about what we've found from actual production deployments. We compare five models across two tiers: the everyday production workhorses — GPT-4.1, Claude Sonnet 4, and Gemini 2.5 Flash — used at volume in document pipelines, support agents, and extraction systems; and the frontier-tier heavyweights — GPT-5.2 and Claude Opus 4.6 — for use cases that genuinely need maximum capability. The observations are from projects where we've had the data to compare: accuracy on domain-specific tasks, per-call cost at volume, latency distributions under realistic load, and failure modes under edge case inputs.

Why benchmark scores mislead production decisions

Benchmarks test the model in isolation on clean, standard problems. Production systems have noise. Here are the differences that reliably matter more than overall benchmark scores:

  • Instruction following consistency. How reliably does the model return structured output (JSON, specific field names, constrained formatting) across thousands of calls? A model that gets output format right 97% of the time vs 99.5% of the time is the difference between 30 errors per 1,000 calls and 5 — which has a massive downstream impact on a pipeline that needs to parse the output.
  • Graceful degradation on edge cases. What happens when the input is ambiguous, poorly formatted, or genuinely unclear? Does the model flag uncertainty, make a best-effort attempt, or confabulate confidently? Confident wrong answers are worse than honest "I'm not sure" responses in most production contexts.
  • Cost at your volume. At 10,000 calls/day, a £1 per million token price difference isn't academic. Context window usage, output length control, and whether you can batch effectively all matter to the actual cost.
  • Latency under load. Average latency is almost useless. You want p95 and p99. The 99th percentile user in your support workflow is waiting how long?
  • API reliability. All major providers have had outages. They differ significantly in how quickly they resolve incidents, what their rate limits are at various pricing tiers, and whether there’s a fallback path (Azure OpenAI, Bedrock, Vertex AI, etc.) when the primary endpoint is degraded.

The comparison: at a glance

MID-TIER — PRODUCTION FRONTIER
Dimension GPT-4.1 Claude Sonnet 4 Gemini 2.5 Flash GPT-5.2 Claude Opus 4.6
Structured output / JSON Excellent Native Structured Output mode Very good Strong format adherence Good Reliable with schema Excellent Native Structured Output Excellent Best instruction compliance
Document understanding Excellent Vision + text, 1M context Excellent Best long-doc reasoning Very good 1M token context + thinking Excellent Superior overall reasoning Best in class 1M context (beta), deepest reasoning
Instruction following Very reliable Improved vs 4o Best in class Enhanced steerability Strong Notable improvement vs 2.0 Excellent Most capable Best in class Frontier-tier precision
Latency (typical agent call) 1.0–2.0s avg 1.2–2.8s avg (standard mode) 0.5–1.5s avg 2.0–4.5s avg 2.5–6.0s avg (standard mode)
Cost (input, per 1M tokens) $2.00 $3.00 $0.30 $1.75 $5.00
Cost (output, per 1M tokens) $8.00 $15.00 $2.50 $14.00 $25.00
Context window 1M 200K 1M 1M 1M (beta)
Hybrid reasoning / thinking No Standard generation Yes Extended thinking mode Yes Thinking budgets No Deep standard generation Yes Extended thinking mode
API reliability / fallback Azure fallback available Bedrock & Vertex AI GCP / Gemini API Azure fallback available Bedrock, Vertex AI, MS Foundry
Function / tool calling Excellent Most consistent Excellent Parallel tool use Very good Native tool use Excellent Best overall capability Excellent Parallel + long-horizon agents

Prices are as of March 2026 and shift often — always check the current provider pricing pages for your exact use case. The first three columns are production mid-tier models; GPT-5.2 and Claude Opus 4.6 are the current frontier tier from their respective labs.

Document processing pipelines

This is the use case that shows up most often in our work: extracting structured data from PDFs, contracts, invoices, and forms. The inputs are messy — scanned documents, mixed formats, tables with irregular layouts, handwritten annotations alongside typed text.

Mid-tier production models

GPT-4.1 — $2.00 / $8.00 per 1M
Strong default for mixed format & vision tasks
Native vision + Structured Output mode is very reliable. The 1M context window means you can feed entire large PDFs without chunking. Schema compliance at volume is consistently 99%+. Best all-round starting point.
Claude Sonnet 4 — $3.00 / $15.00 per 1M
Best mid-tier for long-document reasoning
Wins on multi-page contract analysis requiring synthesis across 50+ pages. Extended thinking mode adds another level of reasoning depth for complex or ambiguous layouts. Enhanced steerability helps when you need precise control over output structure.
Gemini 2.5 Flash — $0.30 / $2.50 per 1M
Best for high-volume, cost-efficient extraction
~6–7× cheaper than GPT-4.1. Thinking budgets available for ambiguous inputs. A major upgrade from the now-deprecated Gemini 2.0 Flash. Best choice when margin is tight and schemas are well-defined.

When to step up to the frontier tier

GPT-5.2 — $1.75 / $14.00 per 1M
Best price-performance at frontier level
Cheaper on input than GPT-4.1 but significantly more capable. Worth reaching for when extraction involves highly unstructured, ambiguous, or legally sensitive content where mid-tier models produce too many edge-case failures.
Claude Opus 4.6 — $5.00 / $25.00 per 1M
Deepest reasoning for the hardest documents
The right pick for high-stakes documents — complex multi-party contracts, regulatory filings, technical due diligence — where reasoning depth and error rate matter more than cost. Extended thinking mode with 1M context (beta) is a step-change for the most demanding pipelines.

Customer support agent systems

This is multi-turn: the agent needs to understand context from conversation history, reference product documentation, handle queries in multiple languages, and decide when to escalate to a human without being trigger-happy about it.

Mid-tier production models

GPT-4.1 — $2.00 / $8.00 per 1M
Solid default for most support agents
Consistent multi-turn handling, strong parallel tool calling, and good cross-language performance. The 1M context window makes it easy to load full product documentation into the prompt without chunking. Reliable across a wide range of query types and domains.
Claude Sonnet 4 — $3.00 / $15.00 per 1M
Best for nuanced, tone-sensitive interactions
Noticeably better at calibrating tone — knowing when to be apologetic, firm, or direct. Enhanced steerability means you can control communication style precisely. Escalation decisions are nuanced. Best choice when customer satisfaction scores are a primary metric.
Gemini 2.5 Flash — $0.30 / $2.50 per 1M
Best for high-volume FAQ & scripted flows
Excellent for bounded FAQ agents where the answer space is well-defined. Its latency advantage (0.5–1.5s) is a genuine benefit for live chat. Instruction following is significantly better than Gemini 2.0 Flash. Use thinking budgets selectively on complex, multi-part queries.

When to step up to the frontier tier

GPT-5.2 — $1.75 / $14.00 per 1M
Best for high-stakes, complex support workflows
Worth the premium when support conversations touch on legally sensitive topics, complex billing disputes, or technical troubleshooting where a wrong response creates real downstream risk. GPT-5.2’s accuracy edge over GPT-4.1 is meaningful in these edge cases.
Claude Opus 4.6 — $5.00 / $25.00 per 1M
Best for enterprise & long-horizon agent tasks
When support involves multi-step agentic workflows — researching across internal knowledge bases, drafting responses, and executing follow-up actions — Opus 4.6’s sustained performance on long task chains is unmatched. Targeted use at the hardest tier of queries only.

Structured data extraction at scale

Pure extraction — pulling specific fields from text, normalising them, and outputting a clean JSON structure. The kind of thing that runs millions of times and needs high reliability and low cost. This use case is where the cost differences between models hit hardest in practice.

The cost reality across all five models at 50M tokens/month input: Gemini 2.5 Flash ~$15 · GPT-5.2 ~$87 · GPT-4.1 ~$100 · Claude Sonnet 4 ~$150 · Claude Opus 4.6 ~$250. The accuracy-to-cost tradeoff is the core decision here — and it depends entirely on your error tolerance and downstream stakes.

The tiered routing approach we’ve landed on for several clients: use Gemini 2.5 Flash as the first pass at scale, route low-confidence outputs (where the model flags uncertainty or downstream validation fails) to GPT-4.1 for a second pass, and reserve Claude Opus 4.6 or GPT-5.2 only for the small slice where GPT-4.1 also struggles. In practice that means paying Gemini prices on ~88% of volume, GPT-4.1 prices on ~10%, and frontier prices on ~2%. Total cost is close to Gemini cost. Accuracy approaches frontier quality. You have to build the routing logic, but it pays off at volume.

Important note: Gemini 2.0 Flash is now deprecated by Google. If you’re running anything on 2.0 Flash in production, migrate to Gemini 2.5 Flash — it’s a meaningfully better model at a still-competitive price.

The honest recommendation

There is no universally best model for production. Here’s a clean decision framework across all five models:

  • Gemini 2.5 Flash ($0.30/$2.50 per 1M) — high volume, cost-sensitive tasks. The right pick when you’re running millions of calls and your task is well-defined enough to measure output quality systematically. Set a fallback route to GPT-4.1 for low-confidence outputs. Don’t use Gemini 2.0 Flash — it’s deprecated. For latency-critical paths (live chat, real-time tools), use standard mode without thinking budgets to preserve the speed advantage.
  • GPT-4.1 ($2.00/$8.00 per 1M) — the reliable all-rounder default. Best starting point for most production workloads. Mature tooling ecosystem (Structured Output mode, parallel function calling, 1M context, Azure fallback). Use this when you need broad capability without a specific reason to deviate.
  • Claude Sonnet 4 ($3.00/$15.00 per 1M) — when quality and tone matter more than cost. Switch from GPT-4.1 when document reasoning quality, nuanced instruction following, or customer-facing tone calibration is the bottleneck. The extended thinking mode can be enabled selectively for the hardest calls. The 15–20% output cost premium over GPT-4.1 is often justified by the quality improvement on complex analytical tasks.
  • GPT-5.2 ($1.75/$14.00 per 1M) — frontier capability at the lowest frontier price. Surprisingly, GPT-5.2’s input cost is lower than GPT-4.1. Use it when mid-tier models hit a quality ceiling on your task — complex reasoning, highly ambiguous inputs, legally sensitive content — and you need frontier-level accuracy without paying Opus prices. The output cost ($14/1M) is where the premium bites.
  • Claude Opus 4.6 ($5.00/$25.00 per 1M) — the frontier ceiling for the hardest problems. Highest cost but also the highest sustained reasoning quality available. Best for long-horizon agentic tasks, complex multi-step workflows, adversarial document review, and high-stakes enterprise automation where errors are expensive. The 1M context window (beta on the API) and extended thinking mode make it the strongest option for the most demanding production use cases.

The thing that matters more than model choice

I'd be doing you a disservice if I finished this article without saying: model selection is typically the fifth or sixth most important decision in a production AI system. Far ahead of it are: prompt design and how you handle context, data quality and pipeline reliability, error handling and fallback behaviour, evaluation methodology (how you actually know the system is working), and deployment infrastructure.

A well-designed system using GPT-4.1 will consistently outperform a poorly-designed system using whichever model wins the latest benchmark. The models are close enough in capability on most production tasks that system design is the differentiator. The wrong model choice costs you maybe 10–20% on a given metric. Poor system design costs you the entire thing.

That said: model costs are real, model quality differences are real, and choosing deliberately rather than defaulting to whatever is most familiar is worth the hour it takes to think it through.

Trying to decide which model fits your pipeline?

Describe your use case and we can give you an honest read on the right tool — including when the answer is something other than a frontier LLM.

Start a conversation