VibeCodingBench: We benchmarked 15 AI coding models on what developers actually do
Current benchmarks have an ecological validity crisis. Models score 70%+ on SWE-bench but struggle in production. Why? They optimize for bug fixes in
Python repos—not the auth flows, API integrations, and CRUD dashboards that occupy 80% of real dev work.
So we built VibeCodingBench: 180 tasks across SaaS features, glue code, AI integration, frontend, API integrations, and code evolution.
Multi-dimensional scoring: Functional (40%) + Visual (20%) + Quality (20%) - Cost/Speed penalties. Security gate: Any OWASP Top 10 vuln = automatic 0.
Top 5 Results (Jan 2026):
Claude Opus 4.5 — 89.2% | $12.31 | 44s
Claude Haiku 4.5 — 89.0% | $3.03 | 22s
Grok 4 Fast — 88.8% | $0.21 | 70s
4⃣ OpenAI GPT-5.2 — 88.8% | $5.01 | 28s
5⃣ Qwen3 Max — 88.6% | $5.42 | 45s
The real story? Cost varies 60x between similar performers. Grok 4 Fast matches GPT-5.2 at 1/25th the cost. Claude Haiku 4.5 delivers near-Opus quality
for $3 total.
Live dashboard: https://vibecoding.llmbench.xyz/
GitHub repo: https://github.com/alt-research/vibe-coding-benchmark-public
Thesis: https://github.com/alt-research/vibe-coding-benchmark-public/blob/main/docs/THESIS.md
The ultimate test isn't fixing a bug in scikit-learn. It's shipping a feature your users need—safely, efficiently—before the sprint ends.
Open source. Contributions welcome.