Show HN: r1_vlm – Open-Source Framework for Visual Reasoning with GRPO(github.com)

5 pointsby skumar17a year ago1 comment

oofbeya year ago
This is really pretty cool. LLM's are so bad at images, it just makes sense to use reasoning to improve them. I'd love to see this applied to a bigger model than 3B, because this task is not difficult. But the attention visualization really demonstrates that it's doing what it's supposed to.
- skumar17a year ago
  Thanks! I really love the visualization too. We have a hosted demo you can try as well!
  https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder
  - oofbeya year ago
    Fun! I wish the demo had the attention visualization. Would that be easy to add? Is the source code for the HF demo in the repo too?
    skumar17a year ago
    Unfortunately it might be a bit challenging as there’s a nontrivial amount of extra computation we do for the viz, but it’s probably possible?
    skumar17a year ago
    The attention demo code is in the /attention_demo directory if you want to try it on your own messages too :)
- xoofooga year ago
  What do you mean LLMs are bad at images? GPT or Claude can read text perfectly, and describe what's in a picture in a lot of detail. I feel like replacing OCR is one of the few things you can actually trust them for.
  - oofbeya year ago
    That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See https://vlmsareblind.github.io/ for some clear examples.
  - skumar17a year ago
    That’s a good observation. For this project, I found that while the base model could “read” the image, it didn’t really understand how to use it. GRPO allowed it to effectively search the solution space.