27 pointsby ammar_x5 hours ago6 comments
  • vanuatuan hour ago
    This benchmark matches my experience with GPT (I occasionally go back to Claude when I run into limits and frequently run into forgotten requirements and reward hacking)

    I do have two questions / critiques:

    - The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless

    - This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!

    • vanuatu37 minutes ago
      Out of curiosity, I examined the worst task:

      https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...

      It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem

      I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)

  • JacobAsmuth26 minutes ago
    I wonder why they didn't test Gemini 3.5 Flash (High).
  • dnnssl23 hours ago
    70% at launch seems pretty saturated, why ship a benchmark frontier models are about to top out on?
  • charleyslee3 hours ago
    tysm for posting this! i'm charley, cofounder of datacurve, we created this benchmark and my team and i are here to answer any q's.
  • toastmaster113 hours ago
    What happened that placed Opus 4.6 on max reasoning below Sonnet 4.6 on a lowered reasoning level?