DeepSWE: A contamination-free benchmark for long-horizon coding agents(deepswe.datacurve.ai)

27 pointsby ammar_x5 hours ago6 comments

vanuatuan hour ago
This benchmark matches my experience with GPT (I occasionally go back to Claude when I run into limits and frequently run into forgotten requirements and reward hacking)
I do have two questions / critiques:
- The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless
- This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!
- vanuatu37 minutes ago
  Out of curiosity, I examined the worst task:
  https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...
  It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem
  I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)
JacobAsmuth26 minutes ago
I wonder why they didn't test Gemini 3.5 Flash (High).
dnnssl23 hours ago
70% at launch seems pretty saturated, why ship a benchmark frontier models are about to top out on?
- vanuatu35 minutes ago
  sell data for them to hillclimb :)
- charleyslee3 hours ago
  [flagged]
charleyslee3 hours ago
tysm for posting this! i'm charley, cofounder of datacurve, we created this benchmark and my team and i are here to answer any q's.
toastmaster113 hours ago
What happened that placed Opus 4.6 on max reasoning below Sonnet 4.6 on a lowered reasoning level?
ammar_x5 hours ago
https://x.com/serenaa_ge/status/2059308400866111692