27 pointsby musculus3 hours ago7 comments
  • v_CodeSentinalan hour ago
    This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

    The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

    • zoho_senian hour ago
      I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.
  • bwfan12331 minutes ago
    I am actually surprised that the LLM came so close. I doubt it had examples in its training set for these numbers. This goes to the heart of "know-how". The LLM should should have said: "I am not sure" but instead gets into rhetoric to justify itself. It actually mimics human behavior for motivated reasoning. At orgs, management is impressed with this overconfident motivated reasoner as it mirrors themselves. To hell with the facts, and the truth, persuation is all that matters.
  • threethirtytwo21 minutes ago
    You don’t need a test to know this we already know there’s heavy reinforcement training done on these models so it optimizes for passing the training. Passing the training means convincing the person rating the answers and that the answer is good.

    The keyword is convince. So it just needs to convince people that’s it’s right.

    It is optimizing for convincing people. Out of all answers that can convince people some can be actual correct answers, others can be wrong answers.

  • benreesman2 hours ago
    They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC.
    • dehsge19 minutes ago
      There are some numbers that are uncomputable in lean. You can do things to approximate them in lean however, those approximates may still be wrong. Leans uncomputable namespace is very interesting.
  • semessieran hour ago
    that's not a proof
    • frontfor23 minutes ago
      Agreed. Asking the AI to do a calculation isn’t the same as asking it to “prove” a mathematical statement in the usual meaning.
    • groundzeros201524 minutes ago
      I think it’s a good way to prove x = sqrt(y). What’s your concern?
  • rakmo34 minutes ago
    Is this hallucination, or is this actually quite human (albeit a specific type of human)? Think of slimy caricatures like a used car salesman, isn't this the exact type of underhandedness you'd expect?
    • 26 minutes ago
      undefined
  • fragmedean hour ago
    > a session with Gemini 2.5 Pro (without Code Execution tools)

    How good are you at programming on a whiteboard? How good is anybody? With code execution tools withheld from me, I'll freely admit that I'm pretty shit at programming. Hell, I barely remember the syntax in some of the more esoteric, unpracticed places of my knowledge. Thus, it's hard not to see case studies like this as dunking on a blindfolded free throw shooter, and calling it analysis.

    • blibblean hour ago
      > How good are you at programming on a whiteboard?

      pretty good?

      I could certainly do a square root

      (given enough time, that one would take me a while)

    • htnthrow11220an hour ago
      It’s like that but if the blindfolded free throw shooter was also the scorekeeper and the referee & told you with complete confidence that the ball went in, when you looked away for a second.
    • cmiles7431 minutes ago
      It's pretty common for software developers to be asked to code up some random algorithm on a whiteboard as part of the interview process.