You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model
In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on