"Multimodal AI systems are increasingly deployed on the assumption that their benchmark performance reflects genuine visual understanding. Our results fundamentally challenge these assumptions. Across every model-benchmark pair tested, the accuracy that frontier models achieved without any access to images exceeded the additional accuracy they gained when images were provided. Moreover, a text-only 3-billion-parameter model, trained solely on question-answer pairs stripped of images, outperformed all frontier multimodal systems and human radiologists on a held-out chest radiology benchmark. Taken together, these results demonstrate that high benchmark accuracy does not reliably indicate visual understanding."
Basically, they are so good at extracting clues from the text of the questions, and extrapolating from them, that they proceed to answer _as if_ they had an image to view. With confidence, of course.
Details: https://x.com/euanashley/status/2037993596956328108