2 pointsby gaurav7153110 hours ago1 comment
  • gaurav7153110 hours ago
    We show that a simple harness fixing 'intent-execution gap' achieves SOTA pass@1 on 21 models (across diverse model providers of Claude, GPT, Gemini, Grok, Qwen) on agentic benchmarks (SWE-Pro, -verif, tb2). This is first time a single open-source harness reproduce/improve results on popular benchmarks for modern LLMs! The code is public to try and build-on: Code: https://github.com/strands-labs/benchmark-harnesses

    More importantly, we also generated 138k high-quality agent trajectories (SOTA pass@1) and present a detailed study on them "Dissecting model behavior through agent trajectories" https://arxiv.org/abs/2606.17454

    Models that achieve similar pass@1 behaves very different internally and we quantize it using several metrics (such as code state-spaces)