to demonstrate this you measure the compute / cost of running and human-verifying the output.
the statistics provided don't at all exclude the possibility that instead of giving the top 5 models each a single opportunity to propose a solution, it may be more efficient to give the 5 opportunities to solve the problem to the best scoring model:
at 24% win rate the null hypothesis (what a usual researcher ought to predict based on common sense) would be that the probability of a loss is 76%, and the probability that it loses N times is (0.76 ^ N), and so the probability of it winning in N attempts is ( 1 - (0.76 ^ N ) ).
So consulting the best scoring model twice (2 x top-1) I would expect: 42.24% better than the giving the 2 top scoring models each a single try ( 1 x top-2 ) as that resulted in 35%
Same for 3x top-1 vs 1x top-3: 56.10% vs 51%
Same for 4x top-1 vs 1x top-4: 66.63% vs 66%
Same for 5x top-1 vs 1x top-5: 74.64% vs 73%
Same for 6x top-1 vs 1x top-6: 80.73% vs 83%
Same for 7x top-1 vs 1x top-7: 85.35% vs 90%
Same for 8x top-1 vs 1x top-8: 88.87% vs 95%
I can't read the numerical error bars on the top-1 model win rate, we could calculate a likelihood from to see if the deviation is statistically significant.
A quibbe with this: you're not predicting it will be the best for whatever task you're throwing at it, you're predicting it will be sufficient.
For well-understood problems you can get adequate results out of a lot of models these days. Having to review n different outputs sounds like a step backwards for most tasks.
I do this sort of thing at the planning stage, though. Especially because there's not necessarily an obvious single "right" answer for a lot of questions, like how to break down a domain, or approaches to coordinating multiple processes. So if three different models suggest three different approaches, it helps me refine what I'm actually looking for in the solution. And that increases my hit rate for my "most models will do something sufficient" above claim.
We still code via interactive sessions with single agents when the stakes are lower (simple things, one off scripts, etc). But for more important stuff, we generally want the highest quality solution possible.
We also use this framework for brainstorming and planning. E.g. sometimes we ask them to write design docs, then compare and contrast. Or intentionally under-specify a task, see what the agents do, and use that to refine the spec before launching the real run.
One question I had - was the judgement blinded? Did judges know which models produced which output?
But that is a good point. Perhaps it should be mapped to something unidentifiable.
Nonetheless you've convinced me to try an even wider variety of models, thanks!
In fact, this makes me think I should add this as a feature to my AI dev tooling - compare responses side by side and pick the best one.
By delegating to sub agents (eg for brainstorming or review), you can break out of local maxima while not using quite as many more tokens.
Additionally, when doing any sort of complex task, I do research -> plan -> implement -> review, clearing context after each stage. In that case, would I want to make 7x research docs, 7x plans, etc.? probably not. Instead, a more prudent use of tokens might be to have Claude do research+planning, and have Codex do a review of that plan prior to implementation.
The question is which multi-agent architecture, hierarchical or competitive, yields the best results under some task/time/cost constraints.
In general, our sense is that competitive is better when you want breadth and uncorrelated solutions. Or when the failure modes across agents are unknown (which is always, right now, but may not be true forever).
You are probably right, but my work pays for as many tokens as I want, which opens up a bunch of tactics that otherwise would be untenable.
I stick with sub-agent approaches outside of work for this reason though, which is more than fair a point
Edit: And this is why you should read the article before you post!
We run big ensembles because we are doing a lot of analysis over the system etc
And how does one compare the results in a way that is easy to parse? 7 models producing 1 PR each is one way, but it doesn’t feel very easy to compare such.
For comparison, there's a `review` command that launches a sandboxed agent to review a given run and rank the various implementations. We usually run 1–3 review agents, pull the top 3 diffs, and do manual review from there.
We're working on better automation for this step right now.