1 pointby a1j9o942 hours ago1 comment
  • a1j9o942 hours ago
    Hey HN,

      I spent the last few weeks exploring whether AI systems could benefit from generating video predictions before making decisions—like how humans mentally simulate "what happens if I pour this coffee?" before acting.
    
      The idea: Show an AI an image, ask "what happens if I push this?", have it generate a video prediction, then compare that prediction to reality. If the prediction looks wrong, maybe the AI could catch its own mistakes.
    
      The result: Current models can't do this. But I learned some interesting things along the way.
    
      What I tested:
      - 7 different architectures for predicting future video frames from VLM latent space
      - Whether perceptual similarity (LPIPS) between predicted and actual video correlates with correctness
      - Self-correction loops where the model gets feedback on its predictions
    
      Key findings:
    
      1. VLMs can't predict the future – Every architecture I tried performed worse than just copying the current frame as the "prediction." The model understands what's in an image but can't predict what will change.
      2. Visual similarity ≠ semantic correctness – This one surprised me. Wrong predictions often looked MORE similar to reality than correct ones (LPIPS correlation: 0.106). You can't use "does it look right?" to catch mistakes.
      3. Some things worked – Hybrid encoders (DINOv2 + VLM) preserve spatial information that VLMs lose. VLMs understand generated video well (93% semantic retention). Small adapters (10M params) work better than large ones (100M).
    
      I'm releasing this as a benchmark proposal. Video generation is improving fast—capabilities that don't exist today might emerge in future models. Seems worth tracking.
    
      Links:
      - Demo video: https://youtu.be/YJxDt_zCrUI
      - Code + paper: https://github.com/a1j9o94/foresight
      - Live demo: https://foresight-demo-kappa.vercel.app
    
      Built with Qwen2.5-VL, LTX-Video, Modal (GPUs), and the Something-Something v2 dataset.
    
      Happy to answer questions about the experiments or methodology.
    • seg_lol2 hours ago
      Why is the demo video not in your readme?
      • a1j9o942 hours ago
        Honestly just didn't think about it. Added it.