Domain experts, doctors, lawyers, engineers, submit questions from their field that probe where frontier AI actually fails. Claude, GPT, and Gemini all attempt simultaneously. Experts flag errors with professional reasoning. Other credentialed professionals in the same domain verify them.
AI benchmark performance has decoupled from real-world professional capability. Models score at or near ceiling on standard evaluations while still failing in ways that domain professionals catch immediately. The benchmarks that exist are either saturated, constructed by the labs themselves, or simply don't capture the judgment that comes from years of field experience.
What's missing is a benchmark built by the people whose expertise is actually at stake. Professionals motivated to find failures, not validate models. Every verified failure becomes a permanent data point. The benchmark compounds continuously and can't be reverse-engineered because the questions come from human judgment, not datasets.
This extends to multimodal inputs. A radiologist can submit an X-ray. A cardiologist can upload a heart sound. A structural engineer can attach a blueprint. The same adversarial evaluation across text, image, audio, and documents in the domains where multimodal model failures matter most.
The downstream goal is a verified record of where frontier AI breaks across professional domains. Useful for labs evaluating models, researchers studying capability gaps, and professionals who need to know where to trust AI and where not to.
Early domains: medicine, law, finance, engineering, coding, trades
Would love domain experts to throw their hardest questions at it. What breaks in your field?
Payment is coming. Right now we’re building the expert network. Verified failures will be compensated monetarily. Would love to have you as an early finance expert, throw your hardest question at it.