But yes, sadly it looks like the agent cheated during the eval
(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...
If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.
That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.
Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.
FYI I use CC for Anthropic models and OpenCode for everything else.
They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.