That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.
Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.
1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.
2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.
3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.
4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.
I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.
Like just today, it made a list of toys for my toddler that fit her developmental stage and play style. Would have taken me 1-2 hrs of browsing multiple websites otherwise
If I'm working on a complex problem and want to go back and forth on software architecture, I like having o3 research prior art and have a back and forth on trade-offs.
If o3 was faster and cheaper I'd use it a lot more.
I'm curious what your workflows are !
Then I use the /login command that opens a browser window to log into Claude Max.
You can confirm Claude Max billing going forward in VS Studio/Claude Code: /cost
"With your Claude Max subscription, no need to monitor cost — your subscription includes Claude Code usage"
However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.
Your model rankings are spot on. I’m hesitant to make the jump to top tier premium models as daily drivers, so I hang out with sonnet 4 and/or Gemini 2.5 pro for most of the day (max mode in Cursor). I don’t want to get used to premium quality coming that easy, for some reason. I completely align with the concise, thoughtful code being worth it though. I’m having to do that myself using tier 2 models. I still use o3 periodically for getting clarity of thought or troubleshooting gnarly bugs that Claude gets caught looping on.
How would you compare Cursor to Claude Code? I’m yet to try the latter.
I'm surprised there isn't a VibeIDE yet that is purpose build to make it possible for your grandmother to execute code output by an LLM.
The major LLM chat interfaces often have code execution built in, so there kind of is, it just doesn't look like what an SWE thinks of as an IDE.
My impression (with Cursor) is that you need to practice some sort of LLM-first design to get the best out of it. Either vibe code your way from the start, or be brutal about limiting what changes the agent can make without your approval. It does force you to be very atomic about your requests, which isn't a bad thing, but writing a robust spec for the prompt is often slower than writing the code by hand and asking for a refactor. As soon as kipple, for lack of a better word, sneaks into the code, it's a reinforcing signal to the agent that it can add more.
It's definitely worth paying the $20 and playing with a few different clients. The rabbit hole is pretty deep and there's still a ton of prompt engineering suggestions from the community. It encourages a lot of creative guardrails, like using pre-commit to provide negative feedback when the model does something silly like try to write a 200 word commit message. I haven't tried JetBrains' agent yet (Junie), but that seems like it would be a good one to explore as well since it presumably integrates directly with the tooling.
Language: Syntax errors rise, and a common form is the syntax of a more common language bleeding through.
Domain: Less so than what humans deem complex, quality is more strongly controlled by how much code and documentation there is for a domain. Interesting is that if in a less common subdomain, it will often revert to a more common approach (for example working on shaders for a game that takes place in a cylinder geometry requires a lot more hand-holding than on a plane). It's usually not that they can't do it, but that they require much more involved prompting to get the context appropriately set up and then managing drifting to default, more common patterns. Related is decisions with long term consequences. LLMs are pretty weak at this. In humans this one comes with experience, so it's rare and an instance of low coverage.
Dates: Related is reverting to obsolete API patterns.
Complexity: While not as dominant as domain coverage, complexity does play a role. With likelihood of error rising with complexity.
This means if you're at the intersection of multiple of these (such as a low coverage problem in a functional language), agent mode will likely be too much of a waste for you. But interactive mode can still be highly productive.
It's mostly about the cost though. Things are far more affordable in the the various apps/subscriptions. Token-priced API's can get very expensive very quickly.
I used Cursor well over a year ago. It gave me a headache. It was very immature. Used cursor more recently: the headache intensity increased. It's not cursor it is the senseless loops hoping for the LLM to spit out something somewhat correct. Revisiting the prompt. Trying to become an elite in language protocols because we need that machine to understand us.
Leaving aside the headache, its side effects. It isn't clear we haven't already maxed out on the productivity tools efficiency. Auto complete. Indexed and searchable doc a second screen rather than having to turn the pages of some reference book. Etc etc.
I'm convinced at this stage that we've already started to trade too far. So far beyond the optimal balance that these aren't diminishing returns. It is absolute diminishing.
Engineers need to spend more time thinking.
I'm convinced that engineers, if they were to chose, would throw this thing out and make space for more drawing boards, would use a 5 minute Solitaire break every 1h. Or take a walk.
For some reason the constant pressure to go faster eventually makes its mark.
It feels right to see thousands of lines of code written up by this thing. It feels aligned with the inadequate way we've been measured.
Anyway. It can get expensive and this is by design.
I have bipolar disorder. This makes programming incredibly difficult for me at times. Almost all the recent improvements to code generation tooling have been a tremendous boon for me. Coding is now no longer this test of how frustrated I can get over the most trivial of tasks. I just ask for what I want precisely and treat responses like a GitHub PR where mistakes may occur. In general (and for the trivial tasks I'm describing) Claude Code will generate correct, good code (I inform it very precisely of the style I want, and tell it to use linters/type-checkers/formatters after making changes) on the first attempt. No corrections needed.
tl;dr - It's been nothing but a boon for this particular mentally ill person.
You can obviously alleviate this by asking it to be more concise but even then it bleeds through sometimes.
I have to think of the guy posting that he fed his entire project codebase to an AI, it refactored everything, modularizing it but still reducing the file count from 20 to 12. "It was glorious to see. Nothing worked of course, but glorious nonetheless".
In the future I can certainly see it get better and better, especially because code is a hard science that reduces down to control flow logic which reduces down to math. It's a much more narrow problem space than, say, poetry or visuals.
I think my coding model ranking is something like Claude Code > Claude 4 raw > Gemini > big gap > o4-mini > o3
All subscription based, not per token pricing. I'm currently using Claude Max. Can't see myself exhausting its usage at this rate but who knows.
the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)
I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution
Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.
No way, is there any way to see the dialog or recreate this scenario!?
> Given the persistence of the error despite multiple attempts to refine the type definitions, I'm unable to fix this specific TypeScript error without a more profound change to the type structure or potentially a workaround that might compromise type safety or accuracy elsewhere. The current type definitions are already quite complex.
The two prior paragraphs, in case you're curious:
> I suspect the issue might be a fundamental limitation or bug in how TypeScript is resolving these highly recursive and conditional types when they are deeply nested. The type system might be "giving up" or defaulting to a less specific type ({ __raw: T }) prematurely.
> Since the runtime logic seems to be correctly hydrating the nested objects (as the builder.build method recursively calls hydrateHelper), the problem is confined to the type system's ability to represent this.
I found, as you can see in the first of the prior two paragraphs, that Gemini often wanted to claim that the issue was on TypeScript's side for some of these more complex issues. As proven by Opus, this simply wasn't the case.
idk whats the hype about gemini, it's really not that good imho
I do not understand how those machines work.
I get that with most of the better models I've tried, although I'd probably personally favor OpenAI's models overall. I think a good system prompt is probably the best way there, rather than relying in some "innate" "clean code" behavior of specific models. This is a snippet of what I use today for coding guidelines: https://gist.github.com/victorb/1fe62fe7b80a64fc5b446f82d313...
> That being said it occasionally does something absolutely stupid. Like completely dumb
That's a bit tougher, but you have to carefully read through exactly what you said, and try to figure out what might have led it down the wrong path, or what you could have said in the first place for it avoid that. Try to work it into your system prompt, then slowly build up your system prompt so every one-shot gets closer and closer to being perfect on every first try.
With Sonnet, at least I don't run out of usage before I actually get it to understand my problem scope.
its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.
If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.
When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.
What is new money coming into OpenAI getting now?
At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.
They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.
Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.
"chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).
All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.
I already see lots of normal people share screenshots of the AI Overview responses.
And when is that going to be? Google clearly has the ability to convert google.com into a ChatGPT clone today if they wanted to. They already have a state of the art model. They have a dozen different AI assistants that no one uses. They have a pointless AI summary on top of search results that returns garbage data 99% of the time. It's been 3+ years and it is clear now that the company is simply too scared to rock the boat and disrupt its search revenue. There is zero appetite for risk, and soon it'll be too late to act.
One well-placed ad campaign could easily change all that. Doesn't hurt that Google can bundle Gemini into Android.
I can switch tomorrow to use gemini or grok or any other llm, and I have, with zero switching cost.
That means one stumble on the next foundational model and their market share drops in half in like 2 months.
Now the same is true for the other llms as well.
For example, I had occasion to chat with a relative who's still in high school recently, and was curious what the situation was in their classrooms re: AI.
tl;dr: LLM use is basically universal, but ChatGPT is not the favored tool. The favored tools are LLMs/apps specifically marketed as study/homework aids.
It seems like the market is fine with seeking specific LLMs for specific kinds of tasks, as opposed to some omni-LLM one-stop-shop that does everything. The market has already and rapidly moved beyond from ChatGPT.
Not to mention I am willing to bet that Gemini has radically more usage than OpenAI's models simply by virtue of being plugged into Google Search. There are distribution effects, I just don't think OpenAI has the strongest position!
I think OpenAI has some first-mover advantage, I just don't think it's anywhere near as durable (nor as large) as you're making it out to be.
Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.
Probably your point still stands.
Although it does feel likely that at minimum, they are neck and neck with Google and others.
There’s been other stuff from Sam Altman that puts it around this summer, so even if it gets delayed past July, it seems pretty clear it’s coming within the next few months.
What? Apple has a revenue of 400B and a market cap of 3T
Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.
I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway
The only way for OpenAI to really get ahead on solid ground is to discover some sort of absolute game changer (new architecture, new algorithm) and manage to keep it bottled away.
I think that will be the game changer OpenAI will show us soon.
Don't they have a data center in progress as we speak? Seems by now they're planning on building not just one huge data center in Texas, but more in other countries too.
they haven't been number one for quite some time and still people can't stop presenting them as the leaders
Plus, it’s a consumer product; it doesn’t matter if people are “presenting them as leaders”, it matters if hundreds of millions of totally average people will open their computers and use the product. OpenAI has that.
Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.
Revenue is not the metric by which these companies are valued...
OAI on the other hand must spend a lot of additional money for every single new user, both free and paid. Adding million new OAI users tomorrow would mean gigantic negative red hole in the profits (adding to the existing negative). OAI has no or almost no benefits of scale, unlike other industries.
I have no knowledge about corporate valuations, but I strongly suspect that OAI valuation need to include this issue.
In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.
they are clearly trolling OpenAI's 4o and o4 models.
It makes you look even more stupid.
Sure I'm a lazy bum, I call the variable "json" instead of "jsonStringForX", but it's contextual (within a closure or function), and I appreciate the feedback, but it makes reviewing the changes difficult (too much noise).
For a code like this, it keeps changing processing_class=tokenizer to "tokenizer=tokenizer", even though the parameter was renamed and even after adding the all caps comment.
#Set up the SFTTrainer
print("Setting up SFTTrainer...")
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
args=sft_config,
processing_class=tokenizer, # DO NOT CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME
)
print("SFTTrainer ready.")
I haven't tried with this latest version, but the 05-06 pro still did it wrong.It is worth it sometimes, but usually I use it to explore ideas and then have o1-pro spit out a perfect solution ready diff test and merge.
"# Added this function" "# Changed this to fix the issue"
No, I know, I was there! This is what commit messages for, not comments that are only relevant in one PR.
# Removed iterMod variable here because it is no longer needed.
It's like it spent too much time hanging out with an engineer who doesn't trust version control and prefers to just comment everything out.Still enjoying Gemini 2.5 Pro more than Claude Sonnet these days, though, purely on vibes.
i've not tested this thoroughly, it's just my ancedotal experience over like a dozen attempts.
It's something I read a lottle while ago in a larger article but can't remember which article it was.
Something like, "Forbidden character list: [—, –]" or "Do NOT use the characters '—' or '–' in any of your output"
I'm thinking of cancelling my ChatGPT subscription because I keep hitting rate limits.
Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.
Also note that AI studio via default free tier API access doesn't seem to fall within "commercial use" in Google's terms of service, which would mean that your prompts can be reviewed by humans and used for training. All info AFAIK.
This is not true for the Gemini 2.5 Pro Preview model, at least. Although this model API is not available on the Free Tier [1], you can still use it on AI Studio.
Seconded.
Either way, Google's transparency with this is very poor - I saw the limits from a VP's tweet
But everyone is using them for different things and it doesn't always generalize. Maybe Claude was great at typescript or ruby or something else I don't do. But for some of us, it definitely was not astroturf for Gemini. My whole team was talking about how much better it was.
I haven't used Claude, but Gemini has always returned better answers to general questions relative to ChatGPT or Copilot. My impression, which could be wrong, is that Gemini is better in situations that are a substitute for search. How do I do this on the command line, tell me about this product, etc. all give better results, sometimes much better, on Gemini.
What are your usecases? Really not my experience, Claude disappoints in Data Science and complex ETL requests in python. O3 on the other hand really is phenomenal.
I can't speak to it now - have mostly been using Claude Code w/ Opus 4 recently.
Still actually falling behind the official scores for o3 high. https://aider.chat/docs/leaderboards/
Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.
The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.
That looks like a pretty nice bump!
But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.
[1]https://nitter.net/OfficialLoganK/status/1930657743251349854...
This table seems to indicate it's markedly worse?
https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...
"Custom Search JSON API: The primary solution offered by Google is the Custom Search JSON API. This API allows you to create a customized search engine that can search a collection of specified websites. While it's not a direct equivalent to a full-fledged Google Search API, it can be configured to search the entire web."
In my experience it's essentially the same as Google Search if configured properly.
- "Something went wrong error" after too many prompts in a day. This was an undocumented rate limit because it never occurs earlier in the day and will immediately disappear if you subscribe for and use a new paid account, but it won't disappear if you make a new free account, and the error going away is strictly tied to how long you wait. Users complained about this for over a year. Of course they lied about the real reasons for this error, and it was never fixed until a few days ago when they rug pulled paying users by introducing actual documented tight rate limits.
- "You've been signed out" error if the model has exceeded its output token budget (or runtime duration) for a single inference, so you can't do things like what Anthropic recommends where you coax the model to think longer.
- I have less definitive evidence for this but I would not be surprised if they programmatically nerf the reasoning effort parameter for multiturn conversations. I have no other explanation for why the chain of thought fails to generate for small context multiturn chats but will consistently generate for ultra long context singleturn chats.
After that i moved to OpenAI, Gemini models just seem unreliable on that regard.
Isn’t this what you can do with system instructions?
Are you talking about Sonnet 4 which never came to Windsurf because Anthropic does not want to support OpenAI?
However, in my personal experience Sonnet 3.x has still been king so far. Will be interesting to watch this unfold. At this point, it's still looking grim for Windsurf.
With the Claude Max development, non-vibing users seem to be going to Claude Code. This makes me think that maybe Cursor should have taken an exit, cause Claude Code is gonna eat everyone's lunch?
I've been preferring to use Copilot agent mode with Sonnet 4, but it asks you to intervene a lot.
Direct chat and copy pasting code? Seems clunky.
Or manually switching in cursor? Although is extra cost and not required for a lot of tasks where Cursor tab is faster and good enough. So need to opt in on demand.
Cline + open router in VSCode?
Something else?