2 pointsby teendifferent9 hours ago1 comment
  • teendifferent9 hours ago
    OP here. I wanted to test whether the vision encoder's pre-training strategy matters when you stitch it into an LLM. So I froze three encoders (CLIP, I-JEPA, supervised ViT), stitched each into Qwen2.5 with a small trainable projector + LoRA (~3M params), and compared.

    Key findings: - CLIP dominates on average (language-aligned embeddings make the projector's job trivial). - But I-JEPA — which has never seen text during pre-training — ties CLIP on compositional reasoning (CLEVR). And scaling the LLM from 0.5B to 1.5B helped more than swapping any encoder.

    Code, trained weights, and eval scripts are all open: https://github.com/REDDITARUN/CLIP-ViT-IJEPA-VLM/tree/main

    Blog: https://teendifferent.substack.com/p/stitching-vision-into-l...

    Curious what others think about I-JEPA-style representations for VLMs — the spatial reasoning results surprised me.