That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See
https://vlmsareblind.github.io/ for some clear examples.