2 pointsby camillemolas3 hours ago6 comments
  • camillemolas3 hours ago
    I built this: rusmarterthananllm.com

    Domain experts, doctors, lawyers, engineers, submit questions from their field that probe where frontier AI actually fails. Claude, GPT, and Gemini all attempt simultaneously. Experts flag errors with professional reasoning. Other credentialed professionals in the same domain verify them.

    AI benchmark performance has decoupled from real-world professional capability. Models score at or near ceiling on standard evaluations while still failing in ways that domain professionals catch immediately. The benchmarks that exist are either saturated, constructed by the labs themselves, or simply don't capture the judgment that comes from years of field experience.

    What's missing is a benchmark built by the people whose expertise is actually at stake. Professionals motivated to find failures, not validate models. Every verified failure becomes a permanent data point. The benchmark compounds continuously and can't be reverse-engineered because the questions come from human judgment, not datasets.

    This extends to multimodal inputs. A radiologist can submit an X-ray. A cardiologist can upload a heart sound. A structural engineer can attach a blueprint. The same adversarial evaluation across text, image, audio, and documents in the domains where multimodal model failures matter most.

    The downstream goal is a verified record of where frontier AI breaks across professional domains. Useful for labs evaluating models, researchers studying capability gaps, and professionals who need to know where to trust AI and where not to.

    Early domains: medicine, law, finance, engineering, coding, trades

    Would love domain experts to throw their hardest questions at it. What breaks in your field?

  • camillemolas2 hours ago
    We’re also very much interested in multimodal. Do you take pictures, recordings, videos, or anything along that in your domain? We want to find out if models can fail using those as well!
  • vrajshroff2 hours ago
    Oh wow! Super interesting. Let me try to ask about antioxidants and oxidative stress. I feel like it’s niche enough that might just work haha
    • camillemolas2 hours ago
      If it fails let me know!! That’s exactly what we are looking for.
  • diegovergara473 hours ago
    This is interesting. I work in private equity secondaries, I wonder if I can beat the LLM. How is the data I generate helpful and is the plan to eventually pay users like me?
    • camillemolas3 hours ago
      Yes private equity secondaries is a great domain for this. The valuation edge cases and LP agreement interpretation are exactly where frontier models fail confidently. The data becomes part of a verified record of AI capability gaps and is valuable to labs and enterprises building finance AI.

      Payment is coming. Right now we’re building the expert network. Verified failures will be compensated monetarily. Would love to have you as an early finance expert, throw your hardest question at it.

  • jasonkim-io2 hours ago
    Interesting stuff! Will check out
    • camillemolas2 hours ago
      Thanks! Hopefully you get to beat it and get paid out $$ but also bragging rights !
  • caillahmolas3 hours ago
    Nice