The idea:
Tokens per dollar
Weighted input/output pricing (75/25 assumption)
Benchmark-normalized quality (Arena, Aider, SWE-bench)
Early results surprised me (local often loses economically unless privacy is heavily valued).
I’m mostly looking for critique of the methodology:
Is quality-adjusted tokens per dollar even the right metric?
Is normalizing ELO to % defensible?
What benchmarks am I missing?