Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.
The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.
Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.
Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.
Just check out any conversation on dynamic vs static typing, talk to a Rust zealot, or ask a backend engineer if microservices were a mistake.
It's unfortunate, and it makes it hard to have proper discussions on these subjects. It would be worthwhile to figure out how we can have more constructive arguments.
Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.
I think for programming the strength of GPT over Opus is winning here over the context window.
On this, absolutely!
I more often use Opus for planning than for implementation. In those cases I really do need the very large context window, because the agent has to read in a bunch of my code base and a bunch of previous plan files and product context and such, to understand what we're talking about.
And then I need to go back and forth with it over a really extended period: getting into a bunch of details, asking it to load how things already work so that we can discuss options for evolution of those, etc.
For that kind of thing, compaction completely destroys its effectiveness because even if you try to serialize out all the decisions made in the conversation into a plan file, the agent still loses e.g. the plan files and code files that it's read in that are adding sharp edges to its understanding of the scope of what's being planned.
For implementation or something like what you're describing in the vein of benchmarking, often I can get away with compaction. Although even then, if the agent needs to have a lot "loaded" into its head, to implement something very, very subtle, complex or far-reaching, in those cases it can be really detrimental if it compacts.
for supporting evidence, see first chart here: https://www.anthropic.com/news/claude-fable-5-mythos-5
Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.
Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.
I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.
There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.
None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.
for most tasks is capable and very cheap, for a days worth of tasks is costing about $10
And given that you can only use Composer with a Cursor monthly subscription, cost comparisons are pointless since an equivalently priced OpenAI subscription gets you just as much usage of the better model.
The other models however are reasonably where I’d expect them to be from experience piloting all of them. Fable is outclassing everything at most things at 10x the cost, but sometimes it isn’t a choice between cheap and expensive, but expensive and possible; I’ll need to learn where that boundary is just as it was the case with other models.
I only reach for Claud when i need to plan something big or want to have a sparring partner to fire of some ideas.
I think what a lot of people don't realize is that you don't need a fronteer model for 80% of coding tasks. Composer 2.5 is often more than good enough, less token hungry and way faster
Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?
I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.
Related: Sonnet 5’s new tokenizer increases token usage by 30%. (https://simonwillison.net/2026/Jun/30/claude-sonnet-5/)
I keep Claude around for some specific tasks:
- Linked up to Figma MCP to implement front-end stuff
- Data analysis, in the "Connect AI to a data source and ask questions" way. I've tried both Opus 4.8 high and GPT 5.5 high for this and Opus is stronger because it gets the intent in the question better
I used to keep it around for planning too, but the 4.8 plans have had more holes than swiss cheese.