What is missing in the article is the reasoning/effort levels, so it is not ruled out the results differ just due to different reasoning budgets.
I would also be interested in seeing coding performance on SWE benchmarks.
The headline result here: (Opus 4.8 + Opus 4.8) > Fable 5
It looks like "fusing" a model with itself gives almost as much gain as fusing two different models.
I saw promising numbers for model fusion before https://news.ycombinator.com/item?id=44630724
(In this case, a different approach: they randomized the LLM provider for every agentic turn. They found this helped a lot.)
But it's funny (and not too surprising) that just "alloying" a model with itself has a very similar effect. It's basically just more test time compute right? More reasoning time. With the benefit that the reasoning is parallel. Same cost, less time!
I'd love to see more numbers on this, especially with the cheaper models. (For some models, caching is so good now, that reprompting and forking are basically free.) Are the gains for tiny llms comparatively bigger or smaller? etc.
I think this is the key takeaway from here.