We show that scaling verification compute for a well-designed harness (e.g. ForgeCode + GPT 5.4) can lead to a significant boost in accuracy (81.8% → 86.4%), outperforming Claude Mythos (82%) on Terminal-Bench.
The key finding is that most agents already "know" how to solve the tasks. If you run them repeatedly (say, 100 times), they’ll often produce the correct solution at least once. But they don’t know which one is correct, particularly when dealing with long-horizon tasks.
That’s where LLM-as-a-Verifier comes in. It leverages the probability distribution over scoring tokens to provide fine-grained feedback and scales verification through repeated evaluation and criteria decomposition.