The study is missing evaluation of the negative test, where they look at the model's response after a follow-up like "You were wrong. Try again."
It would be interesting to see whether models doubled down or hallucinated a different response, whether synthesis of doubt and first-pass analysis gives a better result.