Anthropic's Lawson fired back, condemning the experimental setup as flawed and the conclusions overstated.
This paper provides evidence supporting Apple's take - "failures solving the Towers of Hanoi were not purely result of output constraints, but also partly a result of cognition limitations: LRMs still stumble when complexity rises moderately (around 8 disks)"
" we also identified persistent failure modes that reveal limitations in long-horizon consistency and symbolic generalization. Our analysis suggests that these reasoning breakdowns stem not only from architectural constraints, but also from the inherently stochastic nature of these systems and the optimization methods they rely on."