2 pointsby jiayaoqijia5 hours ago1 comment
  • jiayaoqijia5 hours ago

      VibeCodingBench: We benchmarked 15 AI coding models on what developers actually do                                                                      
                                                                                                                                                              
      Current benchmarks have an ecological validity crisis. Models score 70%+ on SWE-bench but struggle in production. Why? They optimize for bug fixes in   
      Python repos—not the auth flows, API integrations, and CRUD dashboards that occupy 80% of real dev work.                                                
                                                                                                                                                              
      So we built VibeCodingBench: 180 tasks across SaaS features, glue code, AI integration, frontend, API integrations, and code evolution.                 
      Multi-dimensional scoring: Functional (40%) + Visual (20%) + Quality (20%) - Cost/Speed penalties. Security gate: Any OWASP Top 10 vuln = automatic 0.  
                                                                                                                                                              
      Top 5 Results (Jan 2026):                                                                                                                               
                                                                                                                                                              
       Claude Opus 4.5 — 89.2% | $12.31 | 44s                                                                                                               
       Claude Haiku 4.5 — 89.0% | $3.03 | 22s                                                                                                               
       Grok 4 Fast — 88.8% | $0.21 | 70s                                                                                                                    
      4⃣ OpenAI GPT-5.2 — 88.8% | $5.01 | 28s                                                                                                                 
      5⃣ Qwen3 Max — 88.6% | $5.42 | 45s                                                                                                                      
                                                                                                                                                              
      The real story? Cost varies 60x between similar performers. Grok 4 Fast matches GPT-5.2 at 1/25th the cost. Claude Haiku 4.5 delivers near-Opus quality 
      for $3 total.    
                                                                                                                                                              
       Live dashboard: https://vibecoding.llmbench.xyz/                                                                                                     
       GitHub repo: https://github.com/alt-research/vibe-coding-benchmark-public                                                                            
       Thesis: https://github.com/alt-research/vibe-coding-benchmark-public/blob/main/docs/THESIS.md                                                        
                                                                                                                                                              
      The ultimate test isn't fixing a bug in scikit-learn. It's shipping a feature your users need—safely, efficiently—before the sprint ends.               
                                                                                                                                                              
      Open source. Contributions welcome.