Kalibr pretty much freed me from that loop.
I basically arranged GPT-4 and Claude as two different routes, explained that success means accurate citations that I can verify, and now it just works.
Last week, GPT-4 oddly started being very slow on longer papers, and by the time I realized it, the traffic was already automatically diverted to Claude.
It's like the difference between caretaking an agent and actually having a tool that remains functional without constant supervision.
Honestly, I wish I had discovered this a few months ago hehe
Most teams I work with hardcode a single “golden path” for agents, then rely on dashboards, alerts, and tribal knowledge to notice when behavior degrades. By the time someone debugs model choice, tool params, or prompt drift, the environment has already changed again. The feedback loop is slow and brittle.
What’s interesting here is the explicit shift from observability to outcome-driven control. Routing based on actual production success rather than static benchmarks or offline evals aligns with how reliability engineering evolved in other domains. We moved from “what happened?” to “what should the system do next?” years ago.
A couple of questions I’m curious about:
- How do you define and normalize “success” across heterogeneous tasks without overfitting to short-term signals?
- How do you prevent oscillation or path thrashing when outcomes are noisy or sparse?
- Is there a notion of confidence or regret baked into the routing decisions over time?
Overall, this feels less like a router and more like an autonomous control plane for agents. If it holds up under real-world variance, this is a meaningful step toward agents that are self-healing rather than constantly babysat.
Defining success: We don't normalize it. Teams define their own outcome signals (latency, cost, user ratings, task completion, etc). You don't need perfect attribution to beat static configs; even noisy signals surface real patterns when aggregated correctly.
Oscillation: Thompson Sampling. Instead of greedily chasing the current best path, we maintain uncertainty estimates and explore proportionally. Sparse or noisy outcomes widen confidence intervals, which naturally dampens switching. Wilson scoring handles the low-sample edge cases without the wild swings you'd get from raw percentages.
Confidence/regret: Explicit in the routing math. Every path carries uncertainty that decays with evidence. The system minimizes cumulative regret over time rather than optimizing point-in-time decisions.
The gap we're closing is exactly what you mentioned. Self-correcting instead of babysat.
The Thompson Sampling + Wilson score combo is a pragmatic choice. In practice, most agent systems I see fail not because they lack metrics, but because they overreact to them. Noisy reward signals plus greedy selection is how teams end up whipsawing configs or freezing change altogether. Treating uncertainty as a first-class input instead of something to smooth away is the right move.
I also agree with your point on attribution. Perfect attribution is a trap. In real production environments, partial and imperfect outcome signals still dominate static configs if the system can reason probabilistically over time. This mirrors what we learned in reliability and delivery metrics years ago: trend dominance beats point accuracy.
One area I’d be curious about as this matures is organizational adoption rather than the math:
- How teams reason about defining outcomes without turning it into a governance bottleneck
- How you help users build intuition around uncertainty and regret so they trust the system when it routes “away” from what feels intuitively right
- Where humans still need to intervene, if anywhere, once the control plane is established
If this holds up across long-tail tasks and low-frequency failures, it feels like a real step toward agents that behave more like adaptive systems and less like fragile workflows with LLMs bolted on.
Appreciate the thoughtful reply.
Outcome definition: Simpler is better. Teams that start with one binary signal like "did it work?" (call completed, meeting booked, etc.), get value immediately. Governance bottlenecks usually come from overthinking it upfront.
Building trust: When Kalibr routes away from what feels like the "right" model and it works, people are surprised. We capture and show outcome history so teams can see when a path started to degrade and when Kalibr shifted traffic. No LLM decision making means no black box around routing choices, it's all shown in your dashboard when you use Kalibr.
Human intervention: Defining new paths, adding goals, handling edge cases where signal is genuinely sparse. The goal isn't zero humans anywhere, it's getting them out of the reactive debugging loop so they can focus on strategic decisions instead of repeatedly patching failed agents.
Curious, have you built multi step agents and run into the challenge of repeated failures?