2 pointsby jopsammy6 hours ago1 comment
  • jopsammy6 hours ago
    Author here.

    I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound.

    I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models.

    *Findings:*

    * GPQA-Diamond: Inherent error lower bound *26.8%*. * HLE (Sampled): Inherent error lower bound *~58%*.

    Visual Summary of Error Rates: https://i.postimg.cc/nV5hskX2/image1.png

    The most shocking finding is in *HLE*, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions.

    *Exhibit A: The "Phantom Parameter" (Physics)* In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion. *See the forensic reconstruction:* https://i.postimg.cc/nhfV2hY9/image2.png

    *Exhibit B: The Visual Counterfeit (Math)* In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting. *See the visual comparison:* https://i.postimg.cc/6TJKMMZR/image3.png

    *Conclusion:* Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence.

    Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.

    • cmrx645 hours ago
      this feels a bit like a bombshell given the other recent works on emergent misalignment. how long have we been lying to models?
      • jopsammy5 hours ago
        This is a deeply unsettling thought. I hope everyone can see this work. We truly have no idea how much resources have been wasted here.