To the article's point though, it's treated as the gold standard, which it isn't. We should have learned that with the sycophancy-gate.
I'm not sure if the methodology here really is sound for the question at hand. It's a bit like saying, oh prediction markets don't work because 40% of people that voted were wrong.
You can't really get around running your own benchmarks for the job at hand, if you really want to get 95th-percentile performance on a task.