aayushkumar1212 hours ago
Interesting idea especially the “navigation vs selection” framing.
In practice, how are you measuring the +10pp gain? Are you using fixed eval sets or something more dynamic?
I’ve seen small models look better on benchmarks but regress pretty quickly once prompts/tools change slightly, so curious how stable these gains are over time.
- pranabsarkar2 hours ago
  Fixed eval — 80 tools, 200 queries, 4 model sizes. +10pp came from "all tools" vs "tiered" on 1.5B.
  You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts.
gbibas2 hours ago
This is cool. I am working on something similar for code schema reads by AI, which cost me a lot of tokens. I’ll share once battle tested. The idea of abstracting and then giving it a tree to follow is where I landed also.
- pranabsarkaran hour ago
  Let me know your findings.
pranabsarkar2 hours ago
[dead]