Instead of discussing “reasoning” in a vague way, it studies LLM behavior on 3-SAT and especially near the phase transition, where the instances become much harder. This brings the discussion closer to computational complexity and avoids bare benchmarking.
It seems to suggest that many models fail badly in the hard region, while some newer ones may capture a bit more genuine reasoning structure.
I wonder if this is a meaningful bridge between LLM evaluation and complexity theory, or if it is still mostly a stress test and not much more.
Modern SAT solvers are completely cracked. I think there are a lot of potential synergies between such symbolic solvers and machine learning (and maybe even LLMs). But it doesn’t seem like an LLMs ability to directly solve these tasks with no symbolic tool use is going to predict the quality of these synergies.