A simple test-time method that beats Claude Mythos on Terminal-Bench(llm-as-a-verifier.notion.site)

1 pointby jackykwok2 hours ago1 comment

jackykwok2 hours ago
Excited to share LLM-as-a-Verifier, a general-purpose verification framework that can be paired with any agent harness and model.
We show that scaling verification compute for a well-designed harness (e.g. ForgeCode + GPT 5.4) can lead to a significant boost in accuracy (81.8% → 86.4%), outperforming Claude Mythos (82%) on Terminal-Bench.
The key finding is that most agents already "know" how to solve the tasks. If you run them repeatedly (say, 100 times), they’ll often produce the correct solution at least once. But they don’t know which one is correct, particularly when dealing with long-horizon tasks.
That’s where LLM-as-a-Verifier comes in. It leverages the probability distribution over scoring tokens to provide fine-grained feedback and scales verification through repeated evaluation and criteria decomposition.
Blog: https://llm-as-a-verifier.notion.site
Code: https://llm-as-a-verifier.github.io/