1 pointby steinsgate7 hours ago1 comment

steinsgate7 hours ago
We found something surprising about ARC AGI 2: the benchmark aiming to measure human-like fluid intelligence. Just enabling a stateful Python tool boosts performance across models. We got > 4x performance improvement in GPT OSS 120B (high). The effect continues well into frontier territory (GPT 5.2) with double digit gains.
We aren't sure whether these gains happen because code execution is a stronger form of verification compared to pure CoT or because it encourages qualitatively different thinking patterns.
Another interesting finding: interleaved thinking, the model capability behind these gains, seems fragile at the infra/client layer. Soft failures can make capable models look much worse than they actually are.