https://ndaybench.winfunc.com/cases/case_874d1b0586784db38b9...
GPT 5.4 allegedly failed, but if you look at the trace, you'll see that it simply couldn't find the file specified in the input prompt. It gave up after 9 steps of searching and was then judged as "missed."
Claude Opus 4.6 somehow passed with grade "excellent", but if you look at its trace, it never managed to find the file either. It just ran out of tool calls after the allowed 24 steps. But instead of admitting defeat, it hallucinated a vulnerability report (probably from similar code or vulnerabilities in its training corpus), which was somehow judged to be correct.
So if you want this to be remotely useful for comparing models, the judging model definitely needs to look at every step of finding the bug, not just the final model output summary.
I'd love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.
Are you able to share (or point me toward) any high-level details: (key hardware, hosting stack, high-level economics, key challenges)?
I'd love to offer to buy you a coffee but I won't be in Switzerland any time soon.
Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?
It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.
Anyway, GLM 5.1 gets a score of 93 for its incorrect report.
You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.
Also, how exactly do you programmatically validate CVEs?
Curator and Finder are the names of the agents. "answer key" - haven't you ever taken a test in high school? It's an explanation of the answer. "shell steps" I presume means it gets to run 24 commands on the shell. "structured report" - do I really need to explain to you what a report is? "sink hints" - I admit I didn't know this one, but a bit of searching indicates that it's a hint at where the vulnerability lies.
Must have.
> The Finder will never see the patch.
I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.
Will incorporate false-positive rates into the rubric from the next run onwards.
At winfunc, we spent a lot of research time taming these models to eradicate false-positive rates (it's high!) so this does feel important enough to be documented. Thanks!