2 pointsby pranabsarkar2 hours ago3 comments
  • aayushkumar1212 hours ago
    Interesting idea especially the “navigation vs selection” framing.

    In practice, how are you measuring the +10pp gain? Are you using fixed eval sets or something more dynamic?

    I’ve seen small models look better on benchmarks but regress pretty quickly once prompts/tools change slightly, so curious how stable these gains are over time.

    • pranabsarkar2 hours ago
      Fixed eval — 80 tools, 200 queries, 4 model sizes. +10pp came from "all tools" vs "tiered" on 1.5B.

      You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts.

  • gbibas2 hours ago
    This is cool. I am working on something similar for code schema reads by AI, which cost me a lot of tokens. I’ll share once battle tested. The idea of abstracting and then giving it a tree to follow is where I landed also.
  • pranabsarkar2 hours ago
    [dead]