3 pointsby jacklondon7 hours ago3 comments

jacklondon7 hours ago
I found this paper quite interesting.
Instead of discussing “reasoning” in a vague way, it studies LLM behavior on 3-SAT and especially near the phase transition, where the instances become much harder. This brings the discussion closer to computational complexity and avoids bare benchmarking.
It seems to suggest that many models fail badly in the hard region, while some newer ones may capture a bit more genuine reasoning structure.
I wonder if this is a meaningful bridge between LLM evaluation and complexity theory, or if it is still mostly a stress test and not much more.
derrak5 hours ago
My cynical take on this sort of research is that we will never use raw LLMs to solve these kinds of reasoning problems and it’s therefore unclear why we bother to test them on these kinds of benchmarks.
Modern SAT solvers are completely cracked. I think there are a lot of potential synergies between such symbolic solvers and machine learning (and maybe even LLMs). But it doesn’t seem like an LLMs ability to directly solve these tasks with no symbolic tool use is going to predict the quality of these synergies.
Heer_J7 hours ago
[dead]