6 pointsby gk1a day ago1 comment
  • halbguta day ago
    Like any LLM benchmark, LMArena is highly flawed. I do think it has a right to exist. For me anecdotally it has been indicative of which LLMs style I like best, not necessarily its factual accuracy. It hasn't however been a very useful tool to find the best LLM for a given job.

    To the article's point though, it's treated as the gold standard, which it isn't. We should have learned that with the sycophancy-gate.

    I'm not sure if the methodology here really is sound for the question at hand. It's a bit like saying, oh prediction markets don't work because 40% of people that voted were wrong.

    You can't really get around running your own benchmarks for the job at hand, if you really want to get 95th-percentile performance on a task.