We aren't sure whether these gains happen because code execution is a stronger form of verification compared to pure CoT or because it encourages qualitatively different thinking patterns.
Another interesting finding: interleaved thinking, the model capability behind these gains, seems fragile at the infra/client layer. Soft failures can make capable models look much worse than they actually are.