In practice, how are you measuring the +10pp gain? Are you using fixed eval sets or something more dynamic?
I’ve seen small models look better on benchmarks but regress pretty quickly once prompts/tools change slightly, so curious how stable these gains are over time.
You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts.