2 pointsby lihanc1112 hours ago1 comment
  • lihanc1112 hours ago
    Hey HN,

    We dug into 17 billion tokens of behavioral data across 413K AI agent trajectories (CoderForge-Preview) attempting real GitHub issues. Instead of just looking at final SWE-bench scores, we compared successful runs against failing runs on the exact same problem to filter out task-difficulty confounds.

    The biggest surprise? Agents are not junior developers, and prompting them to act like humans actively hurts their success rate.

    Here is what the data actually shows:

    Human exploration rituals predict failure: "View-before-edit" and "grep-before-edit" are negatively correlated with success. Humans do this to build mental models. Agents already have the codebase in their context window; if they are heavily grepping, they aren't learning, they're flailing.

    TDD is the ultimate predictor of success: The single strongest behavioral signal of a passing agent is the fraction of early bash commands dedicated exclusively to running the test suite.

    The Single Responsibility Principle is law: Agents that scatter edits across 3 or more files in the first 30% of their run see their success rate plummet. Successful agents fix one targeted thing at a time.

    Perseverance is a myth: If an agent runs the exact same bash command twice early on, it’s a massive failure signal. They don't adapt; they just get stuck in a loop.

    Check out the article for full content!

    • PaulHoule2 hours ago
      "This post is the first in a series. We are extending this analysis to more realistic workloads beyond artificial SWE benchmarks. Follow the account and stay tuned.---"

      Did something get cut off at the end?

      • lihanc1112 hours ago
        Actually not, i think the --- was just mistakenly typed XD
        • PaulHoule24 minutes ago
          Well these days all eyes are on dashes... You commonly see "--" when humans want to use [em dash] but "---" is unusual.