Stitching Vision Encoders into LLMs: Clip vs. I-JEPA vs. ViT Comparison(teendifferent.substack.com)

2 pointsby teendifferent9 hours ago1 comment

teendifferent9 hours ago
OP here. I wanted to test whether the vision encoder's pre-training strategy matters when you stitch it into an LLM. So I froze three encoders (CLIP, I-JEPA, supervised ViT), stitched each into Qwen2.5 with a small trainable projector + LoRA (~3M params), and compared.
Key findings: - CLIP dominates on average (language-aligned embeddings make the projector's job trivial). - But I-JEPA — which has never seen text during pre-training — ties CLIP on compositional reasoning (CLEVR). And scaling the LLM from 0.5B to 1.5B helped more than swapping any encoder.
Code, trained weights, and eval scripts are all open: https://github.com/REDDITARUN/CLIP-ViT-IJEPA-VLM/tree/main
Blog: https://teendifferent.substack.com/p/stitching-vision-into-l...
Curious what others think about I-JEPA-style representations for VLMs — the spatial reasoning results surprised me.