While these foundation models aren't trying to be calculators, this kind of test previously provided a decent benchmark on their ability to scale composing iterative reasoning steps, and showed they were not that good at it.
At this point I'm tempted to conclude they are pretty good at it, since I don't see how such long calculations could really be considered "in distribution" from training or "memorized," except in the sense the model learned the algorithm correctly.
I still have doubts about how good present the present architecture & training is at learning to "generalize" effectively. e.g. see ARC3
But you can go a very long way, by memorizing everything, being able to compose steps well, being able to try many times, and being able verify as well as a human, even if you aren't so efficient in your "fluid intelligence."
The fraction of human cognition operating today that can be handled with that current approach seems pretty large.