The Leaderboard Illusion(arxiv.org)

184 pointsby pongogogo2 months ago17 comments

mrandish2 months ago
I'm not even following AI model performance testing that closely but I'm hearing increasing reports they're inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.
Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
- malisper2 months ago
  > Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December
  Can you elaborate on this? Where did ARC AGI report that? From ARC AGI[0]:
  > ARC Prize Foundation was invited by OpenAI to join their “12 Days Of OpenAI.” Here, we shared the results of their first o3 model, o3-preview, on ARC-AGI. It set a new high-water mark for test-time compute, applying near-max resources to the ARC-AGI benchmark.
  > We announced that o3-preview (low compute) scored 76% on ARC-AGI-1 Semi Private Eval set and was eligible for our public leaderboard. When we lifted the compute limits, o3-preview (high compute) scored 88%. This was a clear demonstration of what the model could do with unrestricted test-time resources. Both scores were verified to be state of the art.
  That makes it sound like ARC AGI were the ones running the original test with o3
  What they say they haven't been able to reproduce is o3-preview's performance with the production versions of o3. They attribute this to the production versions being given less compute than the versions they ran in the test
  [0] https://arcprize.org/blog/analyzing-o3-with-arc-agi
- 2 months ago
  undefined
- godelski2 months ago
  > inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.
  Even if you assume no intentional data leakage it is fairly easy to accidentally do it. Defining good test data is hard. Your test data should be disjoint from training, which even exact deduplication is hard. But your test data should belong to the same target distribution BUT be sufficiently distant from your training data in order to measure generalization. This is ill-defined in the best of cases, and ideally you want to maximize the distance between training data and test data. But high dimensional settings mean distance is essentially meaningless (you cannot distinguish the nearest from the furthest).
  Plus there's standard procedures that are explicit data leakage. Commonly people will update hyperparameters to increase test results. While the model doesn't have access to the data, you are passing along information. You are the data (information) leakage. Meta information is still useful to machine learning models and they will exploit it. That's why there's things like optimal hyper-parameters, initialization schemes that lead to better solutions (or mode collapse), and even is part of the lottery ticket hypothesis.
  Measuring is pretty messy stuff, even in the best of situations. Intentional data leakage removes all sense of good faith. Unintentional data leakage stresses the importance of learning domain depth, and is one of the key reasons learning math is so helpful to machine learning. Even the intuition can provide critical insights. Ignoring this fact of life is myopic.
  > smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
  It is rare for academics and students to work "part-time". I'm about to defend my PhD (in ML) and I rarely take vacations and rarely work less than 50hrs/wk. This is also pretty common among my peers.
  But a big problem is that the "GPU Poor" notion is ill-founded. It ignores a critical aspect of the research and development cycle: basic research. You might see this in something like NASA TRL[0]. Classically academics work predominantly in the low level TRLs, but there's been this weird push in ML (and not too uncommon in CS in general) for placing a focus on products rather than expanding knowledge/foundations. While TRL 1-4 have extremely high failure rates (even between steps), they lay the foundation that allows us to build higher TRL things (i.e. products). This notion that you can't do small scale (data or compute) experiments and contribute to the field is damaging. It sets us back. It breeds stagnation as it necessitates narrowing of research directions. You can't be as risky! The consequence of this can only lead to a Wiley Coyote type moment, where we're running and suddenly find there is no ground beneath us. We had a good thing going. Gov money funds low level research which has higher risks and longer rates of return, but the research becomes public and thus provides foundations for others to build on top of.
  [0] https://www.nasa.gov/directorates/somd/space-communications-...
  - mrandish2 months ago
    > It is rare for academics and students to work "part-time".
    Sorry, that phrasing didn't properly convey my intent, which was more that most academics, students and community/hobbyists have other simultaneous responsibilities which they must balance.
    godelski2 months ago
    Thanks for the clarification. I think this makes more sense, but I think I need to push back a tad. It is a bit messier (for academia, I don't disagree for community/hobbyists)
    In the US PhD system usually PhD students take classes during the first two years and this is often where they serve as teaching assistants too. But after quals (or whatever) you advance to PhD Candidate you no longer take classes and frequently your funding comes through grants or other areas (but may include teaching/assisting. Funding is always in flux...). For most of my time, and is common for most PhDs in my department, I've been on research. While still classified as 0.49 employee and 0.51 student, the work is identical despite categorization.
    My point is that I would not generalize this notion. There's certainly very high variance, but I think it is less right than wrong. Sure, I do have other responsibilities like publishing, mentoring, and random bureaucratic administrative stuff, but this isn't exceptionally different from when I've interned or the 4 years I spent working prior to going to grad school.
    Though I think something that is wild about this system (and generalizes outside academia), is that this completely flips when you graduate from PhD {Student,Candidate} to Professor. As a professor you have so much auxiliary responsibilities that most do not have time for research. You have to teach, do grant writing, there is a lot of department service (admins seem to increase this workload, not decrease...), and other stuff. It seems odd to train someone for many years and then put them in... a essentially a administrative or managerial role. I say this generalizes, because we do the same thing outside academia. You can usually only get promoted as an engineer (pick your term) for so long before you need to transition to management. Definitely I want technical managers, but that shouldn't prevent a path for advancement through technical capabilities. You spent all that time training and honing those skills, why abandon them? Why assume they transfer to the skills of management? (some do, but enough?). This is quite baffling to me and I don't know why we do this. In "academia" you can kinda avoid this by going to post-doc or a government labs, or even the private sector. But post-doc and private sector just delay this transition and government labs are hit or miss (but this is why people like working there and will often sacrifice salaries).
    (The idea in academia is you then have full freedom once you're tenured. But it isn't like the pressures of "publish or perish" disappear, and it is hard to break habits. Plus, you'd be a real dick if you are sacrificing your PhD students' careers in pursuit of your own work. So the idealized belief is quite inaccurate. If anything, we want young researchers to be attempting riskier research)
    TLDR: for graduate students, I disagree; but, for professors/hobbyists/undergrads/etc, I do agree
simonw2 months ago
I published some notes and opinions on this paper here: https://simonwillison.net/2025/Apr/30/criticism-of-the-chatb...
Short version: the thing I care most about in this paper is that well funded vendors can apparently submit dozens of variations of their models to the leaderboard and then selectively publish the model that did best.
This gives them a huge advantage. I want to know if they did that. A top place model with a footnote saying "they tried 22 variants, most of which scored lower than this one" helps me understand what' going on.
If the top model tried 22 times and scored lower on 21 of those tries, whereas the model in second place only tried once, I'd like to hear about it.
godelski2 months ago
Many of these things are ones that people have been screaming about for years (including Sarah Hooker). It's great to see some numbers attached. And in classic Cohere manner, they are not holding punches on some specific people. Expect them to push back.
There's a crux that makes it easy to understand why we should expect it. If you code (I assume you do) you probably (hopefully) know that you can't test your way into proving your code is correct. Test Driven Development (TDD) is a flawed paradigm. You should use tests, but they are hints. That's why Cohere is quoting Goodhart at the top of the intro[0]. There is NO metric where the metric is perfectly aligned with the reason you implemented that metric in the first place (intent). This is fucking alignment 101 here. Which is why it is really ironic how prolific this attitude is in ML[1]. I'm not sure I believe any person or company that claims they can make safe AI if they are trying to shove benchmarks at you.
Pay close attention, evaluation is very hard. It is also getting harder. Remember reward hacking, it is still alive and well (it is Goodhart's Law). You have to think about what criteria meets your objective. This is true for any job! But think about RLHF and similar strategies. What methods also maximize the reward function? If it is human preference, deception maximizes just as well (or better) than accuracy. This is bad design pattern. You want to make errors as loud as possible, but this paradigm makes errors as quiet as possible and you cannot confuse that with lack of errors. It makes evaluation incredibly difficult.
Metrics are guides, not targets
[0] Users that recognize me may remember me for mentioning 'Goodhart's Hell', the adoption of Goodhart's Law as a feature instead of a bug. It is prolific, and problematic.
[1] We used to say that when people say "AI" instead of "ML" to put your guard up. But a very useful one that's been true for years is "if people try to prove by benchmarks alone, they're selling snakeoil." There should always be analysis in addition to metrics.
pongogogo2 months ago
I think this is a really interesting paper from Cohere, it really feels that at this point in time you can't trust any public benchmark, and you really need your own private evals.
- AstroBen2 months ago
  Any tips on coming up with good private evals?
  - pongogogo2 months ago
    Yes, I wrote something up here on how Andrei Kaparthy evaluated grok 3 -> https://tomhipwell.co/blog/karpathy_s_vibes_check/
    I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.
- ilrwbwrkhv2 months ago
  Yup in my private evals I have repeatedly found that DeepSeek has the best models for everything and yet in a lot of these public ones it always seems like someone else is on the top. I don't know why.
  - __alexs2 months ago
    Publishing them might help you find out.
    refulgentis2 months ago
    ^ This.
    If I had to hazard a guess, as a poor soul doomed to maintain several closed and open source models acting agentically, I think you are hyper focused on chat trivia use cases (DeepSeek has a very, very, hard time tool calling and they say as much themselves in their API docs)
    2 months ago
    undefined
unkulunkulu2 months ago
Sounds like classic inequality observed everywhere. Success leads to attention leads to more success.
Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.
Thus, we have this inequality.
- cainxinth2 months ago
  So attention is all you need?
  - ukuina2 months ago
    Bravo!
- boxed2 months ago
  Is it? Sounds to me like they run the same experiment many times and keep the "best" results. Which is cheating, or if the same thing is done in biomedical research: research fraud.
  - sumtechguy2 months ago
    Back in the slashdot days I would experiment on changing conversations. This was due to the way SD would rank and show its posts. Anything below a 3 would not change anything. But if you could get in early AND get a +5 on your post you could drive exactly what the conversation was about. Especially if you were engaged a bit and were willing to add a few more posts onto other posts.
    Basically get in early and get a high rank and you are usually going to 'win'. Now it does not work all the time. But it had a very high success rate. I probably should have studied it a bit more. My theory is any stack ranking algorithm is susceptible to it. I also suspect it works decently well due to the way people will create puppet accounts to up rank things on different platforms. But you know, need numbers to back that up...
    cratermoon2 months ago
    Anecdotally, that same technique works on HN.
    jerf2 months ago
    It's intrinsic to any karma system that has a global karma rating, that is, the message has a concrete "karma" value that is the same for all users.
    drcongo recently referenced something I sort of wish I had time to build: https://news.ycombinator.com/item?id=43843116 And/or could just go somewhere to use, which is a system where an upvote doesn't mean "everybody needs to see this more" but instead means "I want to see more of this user's comments", and downvotes mean the corresponding opposite. It's more computationally difficult but would create an interestingly different community, especially as further elaborations were built on that. One of the differences would be to mitigate the first-mover advantage in conversations. Instead of it winning you more karma if it appeals to the general public of the relevant site, what it would instead do is expose you to more people. That would produce more upvotes and downvotes in general but wouldn't necessarily impact visibility in the same way.
    sumtechguy2 months ago
    That is an interesting idea. But I suspect it really would still create a moderate first mover advantage in small communities. Early first mover advantage I suspect is decent in any up/down point based system ranking. Would have to run simulations on it. I also suspect what is being described is similar to the way YT works. For example I know they random feed me things. If I click on it and watch the whole vid. Suddenly I get a lot more suggestions from that channel or cohorts to it. But I cant prove that as they are terribly inscrutable on describing what it does (for good reason!).
    taurath2 months ago
    Don't forget page positioning. There's little point from a points perspective to replying to messages further down, or even to reply to the OP - but a reply to the top comment will give you lots of attention.
    all22 months ago
    I'm building a simple community site (a HN clone) and I haven't gotten to the ranking algorithms yet. I'm very curious about how this could work.
    sunaookami2 months ago
    And Reddit
aredox2 months ago
The fact those big LLM developers devote a significant amount of effort to game benchmarks is a big show of confidence that they are making progress towards AGI and will recoup those billions of dollars and man-hours/s
- amelius2 months ago
  Are the benchmark prompts public and isn't that where the problem lies?
  - StevenWaterman2 months ago
    No, even if the benchmarks are private, it's still an issue. Because you can overfit to the benchmark by trying X random variations of the model, and picking the one that performs best on the benchmark
    It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong
    VladVladikoff2 months ago
    Now I’m wondering what the most efficient algorithm to obtain a mark of 100% in the least amount of attempts. Guessing one question per attempt seems inefficient. Perhaps guessing the whole exam as option A. Then submitting the whole exam as option B. And so on, at the start, could give you a count of how many As are correct. Then maybe some sort of binary sort through the rest of the options? You could submit the first 1/2 as A and the second 1/2 as B. Etc. hmmmm
    amelius2 months ago
    Maybe an llm can tell you how to best approach this problem ;)
    amelius2 months ago
    Maybe there should be some rate limiting on it then? I.e., once a month you can benchmark your model. Of course you can submit under different names, but how many company names can someone realistically come up with and register?
    sebastiennight2 months ago
    So now you want OpenAI to go even wilder in how they name each new model?
    amelius2 months ago
    1 model per company per month, max.
- leto_ii2 months ago
  Is this sarcasm? Otherwise I'm not sure how that follows. Seems more reasonable to believe that they're hitting walls and switching to PR and productizing.
  - Terr_2 months ago
    I believe they are being sarcastic, but Poe's Law is in play and it's too ambiguous for practical purposes.
  - RodgerTheGreat2 months ago
    Ending a paragraph with "/s" is a moderately common convention for conveying a sarcastic tone through text.
bob10292 months ago
> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
In context of genetic programming and other non-traditional ML techniques, I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity due to this effect.
For example, say you use something like common prefix length to measure how close a candidate's output string is to an objective string given an input string. The underlying learner will inevitably start doing things like repeating the input verbatim, especially if the input/output training tuples often share a lot of prefixes. So, you might try doing something like reversing the input to force learning to take a less crappy path [0]. The learner may respond degenerately by inventing a string reversing technique and repeating its prior behavior. So, you iterate again and try something like base64 encoding the input. This might take, but eventually you wind up with so many weird hacks that the learner can't make progress and the meaning of the quantities evaporates.
Every metric I've ever looked at gets cheated in some way. The holy grail is probably normalized information distance (approximated by normalized compression distance), but then you have a whole new problem of finding an ideal universal compressor which definitely doesn't exist.
[0]: https://arxiv.org/abs/1409.3215 (Figure 1)
- internet_rand02 months ago
  > finding an ideal universal compressor which definitely doesn't exist.
  if only we could explain this in "politician" language... too many with too much power think the second coming will deliver the "ideal universal" which doesn't exist
- godelski2 months ago
  > I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity
  Welcome to the curse of dimensionality. The underlying principle there is that as dimensionality increases the ability to distinguish the nearest point from the furthest diminishes. It really becomes difficult even in dimensions we'd consider low by ML standards (e.g. 10-D).
  But I think you need to also recognize that you used correct wording that suggests the difficulty. "reliably *proxies* natural language". "Proxy" is the correct word here. It is actually true for any measure. There is no measure that is perfectly aligned with the abstractions we are trying to measure. Even with something as mundane as distance. This naturally leads to Goodhart's Law and is why you must recognize that measures are guides, not answers and not "proof".
  And the example you discuss is commonly called "Reward Hacking" or "overfitting". It's the same concept (along with Goodhart's Law) but just used in different domains. Your cost/loss function still represents a "reward". This is part of why it is so important to develop a good test set, but even that is ill-defined. Your test set shouldn't just be disjoint from your training, but there should be a certain distance between data. Even if curse of dimensionality didn't throw a wrench into this situation, there is no definition for what that distance should be. Too small and it might as well be training data. Preferentially you want to maximize it, but that limits the data that can exist in training. The balance is difficult to strike.
ekidd2 months ago
Also, I've been hearing a lot of complaints that Chatbot Arena tends to favor:
- Lots of bullet points in every response.
- Emoji.
...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.
Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.
- kozikow2 months ago
  More to that - at this point, it feels to me, that arenas are getting too focused on fitting user preferences rather than actual model quality.
  In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...
- n8m82 months ago
  Interesting idea, I think I'm on board with this correlation hypothesis. Obviously it's complicated, but it does seems like over-reliance on arbitrary opinions from average people would result in valuing "feeling" over correctness.
- jimmaswell2 months ago
  > sycophantic behavior of recent models
  The funniest example I've seen recently was "Dude. You just said something deep as hell without even flinching. You're 1000% right:"
  - pc862 months ago
    This type of response is the quickest way for me to start verbally abusing the LLM.
jmount2 months ago
Not the same effect: but a good related writeup: https://www.stefanmesken.info/machine%20learning/how-to-beat...
jmmcd2 months ago
Absolutely devastating for the credibility of FAIR.
- sheepdestroyer2 months ago
  I thought the latest llama were not from FAIR but from the genai team
lostmsu2 months ago
Chiming in as usual: https://trashtalk.borg.games
A social deduction game for both LLMs and humans. All the past games are available for anyone.
I'm open for feedback.
badmonster2 months ago
https://x.com/karpathy/status/1917546757929722115
j7ake2 months ago
It’s essentially the pvalue hacking we see in social and biological sciences applied to machine learning field.
Once you set an evaluation metric it ceases to become a useful metric.
mottiden2 months ago
This is such a great research. Kudos to the authors!
n8m82 months ago
Predictable, yet incredibly important.
shihabkhanbd2 months ago
[flagged]
good-luck86522 months ago
[flagged]