3 pointsby mpweiher5 hours ago1 comment

turtleyacht5 hours ago
The study is missing evaluation of the negative test, where they look at the model's response after a follow-up like "You were wrong. Try again."
It would be interesting to see whether models doubled down or hallucinated a different response, whether synthesis of doubt and first-pass analysis gives a better result.