Whether model routing works is an empirical problem. Existing empirical efforts rely on benchmark accuracy retention, i.e. how does a model routing system score compared to a sophisticated model like Opus 4.7 on a complex task benchmark like Terminal-Bench 2.0.
However, that metric is completely divorced from what we care about. The better metric is utility retention, which takes into account task importance.