Key findings: - CLIP dominates on average (language-aligned embeddings make the projector's job trivial). - But I-JEPA — which has never seen text during pre-training — ties CLIP on compositional reasoning (CLEVR). And scaling the LLM from 0.5B to 1.5B helped more than swapping any encoder.
Code, trained weights, and eval scripts are all open: https://github.com/REDDITARUN/CLIP-ViT-IJEPA-VLM/tree/main
Blog: https://teendifferent.substack.com/p/stitching-vision-into-l...
Curious what others think about I-JEPA-style representations for VLMs — the spatial reasoning results surprised me.