In this article
"Which model should we use?" It's one of the most common questions I get, and the most honestly answered by: it depends on what you're actually doing. The uncomfortable follow-up is that most people asking the question are comparing models on benchmark scores, which are about as useful for production decisions as judging a car on its top speed.
MMLU, HumanEval, MATH — these benchmarks are valid scientific measurements. They're just measuring something different from what you need when you're deploying an agent that processes supplier invoices at 3am, or a support bot that needs to handle an angry customer asking about a billing dispute in three languages, or an extraction pipeline that needs to pull specific fields from 40 different PDF formats reliably.
This article is about what we've found from actual production deployments. We compare five models across two tiers: the everyday production workhorses — GPT-4.1, Claude Sonnet 4, and Gemini 2.5 Flash — used at volume in document pipelines, support agents, and extraction systems; and the frontier-tier heavyweights — GPT-5.2 and Claude Opus 4.6 — for use cases that genuinely need maximum capability. The observations are from projects where we've had the data to compare: accuracy on domain-specific tasks, per-call cost at volume, latency distributions under realistic load, and failure modes under edge case inputs.
Why benchmark scores mislead production decisions
Benchmarks test the model in isolation on clean, standard problems. Production systems have noise. Here are the differences that reliably matter more than overall benchmark scores:
- Instruction following consistency. How reliably does the model return structured output (JSON, specific field names, constrained formatting) across thousands of calls? A model that gets output format right 97% of the time vs 99.5% of the time is the difference between 30 errors per 1,000 calls and 5 — which has a massive downstream impact on a pipeline that needs to parse the output.
- Graceful degradation on edge cases. What happens when the input is ambiguous, poorly formatted, or genuinely unclear? Does the model flag uncertainty, make a best-effort attempt, or confabulate confidently? Confident wrong answers are worse than honest "I'm not sure" responses in most production contexts.
- Cost at your volume. At 10,000 calls/day, a £1 per million token price difference isn't academic. Context window usage, output length control, and whether you can batch effectively all matter to the actual cost.
- Latency under load. Average latency is almost useless. You want p95 and p99. The 99th percentile user in your support workflow is waiting how long?
- API reliability. All major providers have had outages. They differ significantly in how quickly they resolve incidents, what their rate limits are at various pricing tiers, and whether there’s a fallback path (Azure OpenAI, Bedrock, Vertex AI, etc.) when the primary endpoint is degraded.
The comparison: at a glance
| MID-TIER — PRODUCTION | FRONTIER | ||||
|---|---|---|---|---|---|
| Dimension | GPT-4.1 | Claude Sonnet 4 | Gemini 2.5 Flash | GPT-5.2 | Claude Opus 4.6 |
| Structured output / JSON | Excellent Native Structured Output mode | Very good Strong format adherence | Good Reliable with schema | Excellent Native Structured Output | Excellent Best instruction compliance |
| Document understanding | Excellent Vision + text, 1M context | Excellent Best long-doc reasoning | Very good 1M token context + thinking | Excellent Superior overall reasoning | Best in class 1M context (beta), deepest reasoning |
| Instruction following | Very reliable Improved vs 4o | Best in class Enhanced steerability | Strong Notable improvement vs 2.0 | Excellent Most capable | Best in class Frontier-tier precision |
| Latency (typical agent call) | 1.0–2.0s avg | 1.2–2.8s avg (standard mode) | 0.5–1.5s avg | 2.0–4.5s avg | 2.5–6.0s avg (standard mode) |
| Cost (input, per 1M tokens) | $2.00 | $3.00 | $0.30 | $1.75 | $5.00 |
| Cost (output, per 1M tokens) | $8.00 | $15.00 | $2.50 | $14.00 | $25.00 |
| Context window | 1M | 200K | 1M | 1M | 1M (beta) |
| Hybrid reasoning / thinking | No Standard generation | Yes Extended thinking mode | Yes Thinking budgets | No Deep standard generation | Yes Extended thinking mode |
| API reliability / fallback | Azure fallback available | Bedrock & Vertex AI | GCP / Gemini API | Azure fallback available | Bedrock, Vertex AI, MS Foundry |
| Function / tool calling | Excellent Most consistent | Excellent Parallel tool use | Very good Native tool use | Excellent Best overall capability | Excellent Parallel + long-horizon agents |
Prices are as of March 2026 and shift often — always check the current provider pricing pages for your exact use case. The first three columns are production mid-tier models; GPT-5.2 and Claude Opus 4.6 are the current frontier tier from their respective labs.
Document processing pipelines
This is the use case that shows up most often in our work: extracting structured data from PDFs, contracts, invoices, and forms. The inputs are messy — scanned documents, mixed formats, tables with irregular layouts, handwritten annotations alongside typed text.
Mid-tier production models
When to step up to the frontier tier
Customer support agent systems
This is multi-turn: the agent needs to understand context from conversation history, reference product documentation, handle queries in multiple languages, and decide when to escalate to a human without being trigger-happy about it.
Mid-tier production models
When to step up to the frontier tier
Structured data extraction at scale
Pure extraction — pulling specific fields from text, normalising them, and outputting a clean JSON structure. The kind of thing that runs millions of times and needs high reliability and low cost. This use case is where the cost differences between models hit hardest in practice.
The cost reality across all five models at 50M tokens/month input: Gemini 2.5 Flash ~$15 · GPT-5.2 ~$87 · GPT-4.1 ~$100 · Claude Sonnet 4 ~$150 · Claude Opus 4.6 ~$250. The accuracy-to-cost tradeoff is the core decision here — and it depends entirely on your error tolerance and downstream stakes.
The tiered routing approach we’ve landed on for several clients: use Gemini 2.5 Flash as the first pass at scale, route low-confidence outputs (where the model flags uncertainty or downstream validation fails) to GPT-4.1 for a second pass, and reserve Claude Opus 4.6 or GPT-5.2 only for the small slice where GPT-4.1 also struggles. In practice that means paying Gemini prices on ~88% of volume, GPT-4.1 prices on ~10%, and frontier prices on ~2%. Total cost is close to Gemini cost. Accuracy approaches frontier quality. You have to build the routing logic, but it pays off at volume.
Important note: Gemini 2.0 Flash is now deprecated by Google. If you’re running anything on 2.0 Flash in production, migrate to Gemini 2.5 Flash — it’s a meaningfully better model at a still-competitive price.
The honest recommendation
There is no universally best model for production. Here’s a clean decision framework across all five models:
- Gemini 2.5 Flash ($0.30/$2.50 per 1M) — high volume, cost-sensitive tasks. The right pick when you’re running millions of calls and your task is well-defined enough to measure output quality systematically. Set a fallback route to GPT-4.1 for low-confidence outputs. Don’t use Gemini 2.0 Flash — it’s deprecated. For latency-critical paths (live chat, real-time tools), use standard mode without thinking budgets to preserve the speed advantage.
- GPT-4.1 ($2.00/$8.00 per 1M) — the reliable all-rounder default. Best starting point for most production workloads. Mature tooling ecosystem (Structured Output mode, parallel function calling, 1M context, Azure fallback). Use this when you need broad capability without a specific reason to deviate.
- Claude Sonnet 4 ($3.00/$15.00 per 1M) — when quality and tone matter more than cost. Switch from GPT-4.1 when document reasoning quality, nuanced instruction following, or customer-facing tone calibration is the bottleneck. The extended thinking mode can be enabled selectively for the hardest calls. The 15–20% output cost premium over GPT-4.1 is often justified by the quality improvement on complex analytical tasks.
- GPT-5.2 ($1.75/$14.00 per 1M) — frontier capability at the lowest frontier price. Surprisingly, GPT-5.2’s input cost is lower than GPT-4.1. Use it when mid-tier models hit a quality ceiling on your task — complex reasoning, highly ambiguous inputs, legally sensitive content — and you need frontier-level accuracy without paying Opus prices. The output cost ($14/1M) is where the premium bites.
- Claude Opus 4.6 ($5.00/$25.00 per 1M) — the frontier ceiling for the hardest problems. Highest cost but also the highest sustained reasoning quality available. Best for long-horizon agentic tasks, complex multi-step workflows, adversarial document review, and high-stakes enterprise automation where errors are expensive. The 1M context window (beta on the API) and extended thinking mode make it the strongest option for the most demanding production use cases.
The thing that matters more than model choice
I'd be doing you a disservice if I finished this article without saying: model selection is typically the fifth or sixth most important decision in a production AI system. Far ahead of it are: prompt design and how you handle context, data quality and pipeline reliability, error handling and fallback behaviour, evaluation methodology (how you actually know the system is working), and deployment infrastructure.
A well-designed system using GPT-4.1 will consistently outperform a poorly-designed system using whichever model wins the latest benchmark. The models are close enough in capability on most production tasks that system design is the differentiator. The wrong model choice costs you maybe 10–20% on a given metric. Poor system design costs you the entire thing.
That said: model costs are real, model quality differences are real, and choosing deliberately rather than defaulting to whatever is most familiar is worth the hour it takes to think it through.
Trying to decide which model fits your pipeline?
Describe your use case and we can give you an honest read on the right tool — including when the answer is something other than a frontier LLM.
Start a conversation