We show that a simple harness fixing 'intent-execution gap' achieves SOTA pass@1 on 21 models (across diverse model providers of Claude, GPT, Gemini, Grok, Qwen) on agentic benchmarks (SWE-Pro, -verif, tb2). This is first time a single open-source harness reproduce/improve results on popular benchmarks for modern LLMs!
The code is public to try and build-on:
Code:
https://github.com/strands-labs/benchmark-harnessesMore importantly, we also generated 138k high-quality agent trajectories (SOTA pass@1) and present a detailed study on them
"Dissecting model behavior through agent trajectories" https://arxiv.org/abs/2606.17454
Models that achieve similar pass@1 behaves very different internally and we quantize it using several metrics (such as code state-spaces)