Great question! It's probabilistic so not really "right vs wrong" on any single question, but who better estimated the likelihood.
One big difference shows up when there's no useful context - we ran the same eval WITHOUT including any useful up-to-date context with questions. In this case, GPT-5 stays overconfident and its BSS drops to -11.3% (vs -4.3% ours) - worse than just guessing the base rate.
So one advantage of the RL training is just learning to know what you don't know, and identify when there's real signal.