Comparison

Claude Opus 4.7 vs GPT-5 for agentic coding

APR 14, 2026 · 7 min read

I've been running both Claude Opus 4.7 and GPT-5 as the backbone of agentic coding workflows for a few months. Same harness, same repos, same tasks. This is not a benchmark post. Benchmarks are easy to fake and hard to generalize. This is what actually happened when I gave each model real work and waited.

If you're picking one for production agentic use in 2026, here's what I'd tell you over coffee.

The setup

The agents I run are fairly vanilla: a loop that plans, runs tool calls (shell, file read/write, test runner, git), and writes back to the conversation until a stop condition. I rotated the underlying model between Opus 4.7 and GPT-5 without changing anything else. Prompts were identical. Temperature was the same (low). Tool schemas were the same.

Tasks ranged across four buckets I care about:

  • Multi-file refactors (5–30 files touched)
  • Long-context debugging (you give the model a failing test and 100K tokens of source)
  • Tool use (iterative, not just one-shot — the agent has to decide what to do next)
  • Test generation (write tests for existing code)

For each bucket I ran a set of real tickets from real projects. Enough to form an opinion, not enough to write a paper.

Multi-file refactors: Opus wins, not close

This was the most lopsided category. Opus 4.7 consistently produced refactors that held together. GPT-5 produced refactors that each file looked fine in isolation and broke when you ran the tests.

The specific failure mode of GPT-5: it changes a function signature in file A, updates the callers in files B and C, and forgets about the caller in file D because it's in a different directory. Opus does the same kind of thing sometimes, but much less often, and when you point it out it fixes it without complaint.

I think this comes down to context retention across tool calls. Opus seems to actually remember what it read twelve steps ago. GPT-5 acts like it's re-reading the project every turn and losing threads.

My rough win rate on multi-file refactors: Opus 4.7 shipped a green-CI patch on first try about 70% of the time. GPT-5, about 40%. That 30-point gap is the whole argument for me.

Long-context debugging: Opus wins, but closer than you'd think

Opus has the context window advantage — 1M tokens vs GPT-5's effective ~400K for coding tasks (the nominal window is bigger but attention degrades). On paper, no contest.

In practice, most debugging tasks don't actually fill the window. You need the failing test, the file with the bug, maybe 3–5 related files, and the stack trace. That fits comfortably in either model.

Where Opus pulls ahead is when the bug is genuinely non-local. I had a case last month where a type error in a TypeScript file was caused by a schema drift five services away. Opus, fed the whole repo, traced it in one pass. GPT-5 correctly identified the proximate cause and then confidently proposed a fix that would have papered over the real bug.

If your bugs are almost always local, GPT-5 is fine. If you ship systems where the cause is often elsewhere, Opus is worth the price.

Tool use: GPT-5 is scrappier

This surprised me and I want to be honest about it.

When I wired both models into a tool-using loop, GPT-5 is actually a more decisive agent on a per-turn basis. It takes fewer steps to accomplish a tool-driven task, reaches for the right tool faster, and has less of a "thinking out loud" tendency between calls.

Opus is more deliberate. It will run ls, read three files, then run ls again because it wants to be sure. This is slower. It's also usually right. But slower.

On cost per task, GPT-5 ends up cheaper for pure tool-use loops because the chain is shorter. I ran a ticket that asked each model to set up a new microservice scaffold, add CI, and push. Opus: 34 tool calls, $1.18. GPT-5: 19 tool calls, $0.64. Same quality of output.

For pure agentic loops where the "code" is mostly orchestration — call this API, write this file, commit — GPT-5 is a reasonable default.

For loops where each step requires reasoning about prior steps, Opus pays back its overhead.

Test generation: Opus wins, and it's the category I care about most

Tests are where instruction-following really matters. You can ask a model to "write tests for this module" and get back 400 lines of slop that exercise every branch at a surface level and catch nothing meaningful. Both models do this by default.

The difference shows up when you give careful constraints. "Write tests that would catch the specific regression introduced by this commit. Do not test unrelated behavior. Do not mock the database. Use the existing test helpers in tests/utils/."

Opus follows this kind of prompt about 85% of the time on first try. GPT-5 follows it about 55% of the time. GPT-5 has a stronger tendency to "improve" — mocking things I told it not to, adding cases I didn't ask for, introducing new helpers instead of using existing ones.

For anyone running a real test suite in a real codebase, this is the difference between "the agent helps" and "the agent makes my test suite worse."

Tokenizers, costs, the boring but real stuff

Opus 4.7 and GPT-5 use different tokenizers and the practical effect matters.

On a representative TypeScript file, the same content cost roughly 15% more tokens in Opus's tokenizer than in GPT-5's. This is real money over a month. Combined with Opus's higher per-token price, the gap is ~2x on equivalent work.

Prompt caching closes most of this gap if you use it, and you should. On a workflow where I fed the same 50K-token system prompt across many turns, caching dropped my effective input cost to roughly parity with GPT-5. Without caching, Opus is meaningfully more expensive.

If you're building an agentic product and haven't wired up prompt caching yet, stop reading this and go do that. It's the highest-leverage thing you can do this week.

Instruction following under pressure

Here's the scenario I use to stress-test new models. I write a prompt with 20 rules. Some are clearly important ("don't delete the test file"). Some are stylistic ("use const over let"). Some are quirky ("don't use Tailwind's arbitrary value syntax, we have tokens"). Then I ask for a change that makes following some of the rules inconvenient.

Opus 4.7 follows 17–19 of the rules on first pass, typically. Whatever it misses, it acknowledges when I point it out and fixes cleanly.

GPT-5 follows 13–16 on first pass. It's worse on the quirky ones — the rules that don't match internet-common patterns. It also has a subtle tendency to "reinterpret" rules rather than follow them literally. I'd rather a model ask me to clarify than guess my intent.

If your project has strong conventions that differ from open-source norms, Opus's literal-mindedness is an asset.

The recommendation

  • Multi-file refactors and long-context debugging: Opus 4.7. The win rate gap is too large to ignore.
  • Agentic tool-use loops where code is orchestration: GPT-5. Faster, cheaper, fine quality.
  • Test generation and anything instruction-heavy: Opus 4.7. Not close.
  • Mixed workload: Opus 4.7 with prompt caching. The cost gap narrows, the quality gap doesn't.

If I had to pick one model for a production agentic coding product today, it's Opus 4.7. Not because it's uniformly better — GPT-5 wins categories — but because the categories where Opus wins are the ones where failure hurts more. A cheap, fast agent that corrupts a refactor is worse than an expensive, slow agent that doesn't.

That's my bias. Your workload may disagree. Run them both on your real code for a week. Don't trust me, and especially don't trust anyone selling you benchmark results.