8 pointsby mengk4 hours ago1 comment
  • mengk4 hours ago
    Terminal-Bench evaluates how well models carry out complex tasks in the terminal. On the official leaderboard, GPT-5.1 Codex underperforms GPT-5 Codex by 6.5 percentage points, even with the same scaffold. What explains the regression? In under an hour, Docent finds that the regression probably stems from timeout errors, not performance.