3 pointsby mpweiher5 hours ago1 comment
  • turtleyacht5 hours ago
    The study is missing evaluation of the negative test, where they look at the model's response after a follow-up like "You were wrong. Try again."

    It would be interesting to see whether models doubled down or hallucinated a different response, whether synthesis of doubt and first-pass analysis gives a better result.