2 pointsby mariuz3 hours ago1 comment

olooney3 hours ago
Incredibly detailed! The vision transformer stuff in particular is very useful to know. It's interesting that the token budgets are so much higher (up to 1120) than GPT, which uses 170 tokens per 512x512 tile. I wonder if that will lead to more granular spatial vision, something GPT struggles with.