18 pointsby T-A5 hours ago5 comments
  • chrisjj4 hours ago
    The only reasoning failures here are in the minds of humans gulled into expecting chatbot reasoning ability.
    • altmanaltman18 minutes ago
      But how else will Dario raise Series X
  • Lapel2742an hour ago
    > These models fail significantly in understanding real-world social norms (Rezaei et al., 2025), aligning with human moral judgments (Garcia et al., 2024; Takemoto, 2024), and adapting to cultural differences (Jiang et al., 2025b). Without consistent and reliable moral reasoning, LLMs are not fully ready for real-world decision-making involving ethical considerations.

    LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.

  • donperignonan hour ago
    an llm will never reason. reasoning is an emergent behavior of those systems that is poorly understood. neurosymbolic systems will be what combined with llm will define the future of AI
    • hackinthebochsan hour ago
      What are neurosymbolic systems supposed to bring to the table that LLMs can't in principle? A symbol is just a vehicle with a fixed semantics in some context. Embedding vectors of LLMs are just that.
  • simianwords44 minutes ago
    i'm very skeptical of this paper.

    >Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).

    This is very misleading and I think flat out wrong. What's the best way to falsify this claim?

    • simianwords39 minutes ago
      >Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess richer logical structures that facilitate targeted perturbations. MWPs exemplify this, as their logic can be readily abstracted into reusable templates. Researchers use this property to generate variants by sampling numeric values (Gulati et al., 2024; Qian et al., 2024; Li et al., 2024b) or substituting irrelevant entities (Shi et al., 2023; Mirzadeh et al., 2024). Structural transformations – such as exchanging known and unknown components (Deb et al., 2024; Guo et al., 2024a) or applying small alterations that change the logic needed to solve problems (Huang et al., 2025b) – further highlight deeper robustness limitations.

      I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.

      • otabdeveloper48 minutes ago
        > We have models that are doing better than humans at IMO.

        Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.

        (Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)

  • sergiomattei3 hours ago
    Papers like these are much needed bucket of ice water. We antropomorphize these systems too much.

    Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.

    So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.

    Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...

    • otabdeveloper45 minutes ago
      > We antropomorphize these systems too much.

      They're sold as AGI by the cloud providers and the whole stock market scam will collapse if normies are allowed to peek behind the curtain.

    • vagrantstreet2 hours ago
      Isn't it strange that we expect them to act like humans even though after a model was trained it remains static? How is this supposed to be even close to "human like" anyway
      • mettamagean hour ago
        > Isn't it strange that we expect them to act like humans even though after a model was trained it remains static?

        An LLM is more akin to interacting with a quirky human that has anterograde amnesia because it can't form long-term memories anymore, it can only follow you in a long-ish conversation.

      • LiamPowellan hour ago
        If we could reset a human to a prior state after a conversation then would conversations with them not still be "human like"?

        I'm not arguing that LLMs are human here, just that your reasoning doesn't make sense.

    • lostmsuan hour ago
      https://en.wikipedia.org/wiki/List_of_cognitive_biases

      Specifically, the idea that LLMs fail to solve some tasks correctly due to fundamental limitations where humans also fail periodically well may be an instance of the fundamental attribution error.