My understanding is that there was only 1 run per configuration?
If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance.
Someone did the same with lambda calculus. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 different trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them on custom flame graphs.
When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much.
It is the same idiocy that permeates EV cars. You buy an expensive car to go from A to B and at the same time offer you comfort. When I have to think about using the seat heating or not, I'm out of my comfort zone. So no, fuck caveman, and I don't fucking care about the burned tokens.
Be brief. It's easy, no setup needed, not another mindless mumbojumbo extension and its 325 dependencies.