I do have two questions / critiques:
- The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless
- This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!
https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...
It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem
I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)