Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
Can you elaborate on this? Where did ARC AGI report that? From ARC AGI[0]:
> ARC Prize Foundation was invited by OpenAI to join their “12 Days Of OpenAI.” Here, we shared the results of their first o3 model, o3-preview, on ARC-AGI. It set a new high-water mark for test-time compute, applying near-max resources to the ARC-AGI benchmark.
> We announced that o3-preview (low compute) scored 76% on ARC-AGI-1 Semi Private Eval set and was eligible for our public leaderboard. When we lifted the compute limits, o3-preview (high compute) scored 88%. This was a clear demonstration of what the model could do with unrestricted test-time resources. Both scores were verified to be state of the art.
That makes it sound like ARC AGI were the ones running the original test with o3
What they say they haven't been able to reproduce is o3-preview's performance with the production versions of o3. They attribute this to the production versions being given less compute than the versions they ran in the test
> inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test.
Even if you assume no intentional data leakage it is fairly easy to accidentally do it. Defining good test data is hard. Your test data should be disjoint from training, which even exact deduplication is hard. But your test data should belong to the same target distribution BUT be sufficiently distant from your training data in order to measure generalization. This is ill-defined in the best of cases, and ideally you want to maximize the distance between training data and test data. But high dimensional settings mean distance is essentially meaningless (you cannot distinguish the nearest from the furthest).Plus there's standard procedures that are explicit data leakage. Commonly people will update hyperparameters to increase test results. While the model doesn't have access to the data, you are passing along information. You are the data (information) leakage. Meta information is still useful to machine learning models and they will exploit it. That's why there's things like optimal hyper-parameters, initialization schemes that lead to better solutions (or mode collapse), and even is part of the lottery ticket hypothesis.
Measuring is pretty messy stuff, even in the best of situations. Intentional data leakage removes all sense of good faith. Unintentional data leakage stresses the importance of learning domain depth, and is one of the key reasons learning math is so helpful to machine learning. Even the intuition can provide critical insights. Ignoring this fact of life is myopic.
> smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field.
It is rare for academics and students to work "part-time". I'm about to defend my PhD (in ML) and I rarely take vacations and rarely work less than 50hrs/wk. This is also pretty common among my peers.But a big problem is that the "GPU Poor" notion is ill-founded. It ignores a critical aspect of the research and development cycle: basic research. You might see this in something like NASA TRL[0]. Classically academics work predominantly in the low level TRLs, but there's been this weird push in ML (and not too uncommon in CS in general) for placing a focus on products rather than expanding knowledge/foundations. While TRL 1-4 have extremely high failure rates (even between steps), they lay the foundation that allows us to build higher TRL things (i.e. products). This notion that you can't do small scale (data or compute) experiments and contribute to the field is damaging. It sets us back. It breeds stagnation as it necessitates narrowing of research directions. You can't be as risky! The consequence of this can only lead to a Wiley Coyote type moment, where we're running and suddenly find there is no ground beneath us. We had a good thing going. Gov money funds low level research which has higher risks and longer rates of return, but the research becomes public and thus provides foundations for others to build on top of.
[0] https://www.nasa.gov/directorates/somd/space-communications-...
Sorry, that phrasing didn't properly convey my intent, which was more that most academics, students and community/hobbyists have other simultaneous responsibilities which they must balance.
In the US PhD system usually PhD students take classes during the first two years and this is often where they serve as teaching assistants too. But after quals (or whatever) you advance to PhD Candidate you no longer take classes and frequently your funding comes through grants or other areas (but may include teaching/assisting. Funding is always in flux...). For most of my time, and is common for most PhDs in my department, I've been on research. While still classified as 0.49 employee and 0.51 student, the work is identical despite categorization.
My point is that I would not generalize this notion. There's certainly very high variance, but I think it is less right than wrong. Sure, I do have other responsibilities like publishing, mentoring, and random bureaucratic administrative stuff, but this isn't exceptionally different from when I've interned or the 4 years I spent working prior to going to grad school.
Though I think something that is wild about this system (and generalizes outside academia), is that this completely flips when you graduate from PhD {Student,Candidate} to Professor. As a professor you have so much auxiliary responsibilities that most do not have time for research. You have to teach, do grant writing, there is a lot of department service (admins seem to increase this workload, not decrease...), and other stuff. It seems odd to train someone for many years and then put them in... a essentially a administrative or managerial role. I say this generalizes, because we do the same thing outside academia. You can usually only get promoted as an engineer (pick your term) for so long before you need to transition to management. Definitely I want technical managers, but that shouldn't prevent a path for advancement through technical capabilities. You spent all that time training and honing those skills, why abandon them? Why assume they transfer to the skills of management? (some do, but enough?). This is quite baffling to me and I don't know why we do this. In "academia" you can kinda avoid this by going to post-doc or a government labs, or even the private sector. But post-doc and private sector just delay this transition and government labs are hit or miss (but this is why people like working there and will often sacrifice salaries).
(The idea in academia is you then have full freedom once you're tenured. But it isn't like the pressures of "publish or perish" disappear, and it is hard to break habits. Plus, you'd be a real dick if you are sacrificing your PhD students' careers in pursuit of your own work. So the idealized belief is quite inaccurate. If anything, we want young researchers to be attempting riskier research)
TLDR: for graduate students, I disagree; but, for professors/hobbyists/undergrads/etc, I do agree
Short version: the thing I care most about in this paper is that well funded vendors can apparently submit dozens of variations of their models to the leaderboard and then selectively publish the model that did best.
This gives them a huge advantage. I want to know if they did that. A top place model with a footnote saying "they tried 22 variants, most of which scored lower than this one" helps me understand what' going on.
If the top model tried 22 times and scored lower on 21 of those tries, whereas the model in second place only tried once, I'd like to hear about it.
There's a crux that makes it easy to understand why we should expect it. If you code (I assume you do) you probably (hopefully) know that you can't test your way into proving your code is correct. Test Driven Development (TDD) is a flawed paradigm. You should use tests, but they are hints. That's why Cohere is quoting Goodhart at the top of the intro[0]. There is NO metric where the metric is perfectly aligned with the reason you implemented that metric in the first place (intent). This is fucking alignment 101 here. Which is why it is really ironic how prolific this attitude is in ML[1]. I'm not sure I believe any person or company that claims they can make safe AI if they are trying to shove benchmarks at you.
Pay close attention, evaluation is very hard. It is also getting harder. Remember reward hacking, it is still alive and well (it is Goodhart's Law). You have to think about what criteria meets your objective. This is true for any job! But think about RLHF and similar strategies. What methods also maximize the reward function? If it is human preference, deception maximizes just as well (or better) than accuracy. This is bad design pattern. You want to make errors as loud as possible, but this paradigm makes errors as quiet as possible and you cannot confuse that with lack of errors. It makes evaluation incredibly difficult.
Metrics are guides, not targets
[0] Users that recognize me may remember me for mentioning 'Goodhart's Hell', the adoption of Goodhart's Law as a feature instead of a bug. It is prolific, and problematic.
[1] We used to say that when people say "AI" instead of "ML" to put your guard up. But a very useful one that's been true for years is "if people try to prove by benchmarks alone, they're selling snakeoil." There should always be analysis in addition to metrics.
I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.
If I had to hazard a guess, as a poor soul doomed to maintain several closed and open source models acting agentically, I think you are hyper focused on chat trivia use cases (DeepSeek has a very, very, hard time tool calling and they say as much themselves in their API docs)
Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.
Thus, we have this inequality.
Basically get in early and get a high rank and you are usually going to 'win'. Now it does not work all the time. But it had a very high success rate. I probably should have studied it a bit more. My theory is any stack ranking algorithm is susceptible to it. I also suspect it works decently well due to the way people will create puppet accounts to up rank things on different platforms. But you know, need numbers to back that up...
drcongo recently referenced something I sort of wish I had time to build: https://news.ycombinator.com/item?id=43843116 And/or could just go somewhere to use, which is a system where an upvote doesn't mean "everybody needs to see this more" but instead means "I want to see more of this user's comments", and downvotes mean the corresponding opposite. It's more computationally difficult but would create an interestingly different community, especially as further elaborations were built on that. One of the differences would be to mitigate the first-mover advantage in conversations. Instead of it winning you more karma if it appeals to the general public of the relevant site, what it would instead do is expose you to more people. That would produce more upvotes and downvotes in general but wouldn't necessarily impact visibility in the same way.
It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong
In context of genetic programming and other non-traditional ML techniques, I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity due to this effect.
For example, say you use something like common prefix length to measure how close a candidate's output string is to an objective string given an input string. The underlying learner will inevitably start doing things like repeating the input verbatim, especially if the input/output training tuples often share a lot of prefixes. So, you might try doing something like reversing the input to force learning to take a less crappy path [0]. The learner may respond degenerately by inventing a string reversing technique and repeating its prior behavior. So, you iterate again and try something like base64 encoding the input. This might take, but eventually you wind up with so many weird hacks that the learner can't make progress and the meaning of the quantities evaporates.
Every metric I've ever looked at gets cheated in some way. The holy grail is probably normalized information distance (approximated by normalized compression distance), but then you have a whole new problem of finding an ideal universal compressor which definitely doesn't exist.
[0]: https://arxiv.org/abs/1409.3215 (Figure 1)
if only we could explain this in "politician" language... too many with too much power think the second coming will deliver the "ideal universal" which doesn't exist
> I've been having difficulty attempting to locate a simple fitness function that reliably proxies natural language string similarity
Welcome to the curse of dimensionality. The underlying principle there is that as dimensionality increases the ability to distinguish the nearest point from the furthest diminishes. It really becomes difficult even in dimensions we'd consider low by ML standards (e.g. 10-D).But I think you need to also recognize that you used correct wording that suggests the difficulty. "reliably *proxies* natural language". "Proxy" is the correct word here. It is actually true for any measure. There is no measure that is perfectly aligned with the abstractions we are trying to measure. Even with something as mundane as distance. This naturally leads to Goodhart's Law and is why you must recognize that measures are guides, not answers and not "proof".
And the example you discuss is commonly called "Reward Hacking" or "overfitting". It's the same concept (along with Goodhart's Law) but just used in different domains. Your cost/loss function still represents a "reward". This is part of why it is so important to develop a good test set, but even that is ill-defined. Your test set shouldn't just be disjoint from your training, but there should be a certain distance between data. Even if curse of dimensionality didn't throw a wrench into this situation, there is no definition for what that distance should be. Too small and it might as well be training data. Preferentially you want to maximize it, but that limits the data that can exist in training. The balance is difficult to strike.
- Lots of bullet points in every response.
- Emoji.
...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.
Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.
In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...
The funniest example I've seen recently was "Dude. You just said something deep as hell without even flinching. You're 1000% right:"
A social deduction game for both LLMs and humans. All the past games are available for anyone.
I'm open for feedback.
Once you set an evaluation metric it ceases to become a useful metric.