If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:
- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)
- Opus: 1306/2000 questions answered, of which 294 were correct (22%)
So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.
Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.
Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.
But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.
This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exam questions you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.
The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."
This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.
But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.
Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.
https://math.sciencebench.ai/benchmarks
I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.
It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?
GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.
... that are therefore liable to be in the training data?
> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.
So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.
A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.
Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.
The goal was not to define unsolved problems.
But as such, the problems are also not previously published problems.
This seems quite reasonable IMHO.