Evaluations for Testing Agentic AI

1 pointby stichers8 days ago1 comment

alexhans6 days ago
> Curious what peeple are doing in the real world
It's hard to answer this as the right thing to do might depend a lot on the environment.
My pitch would be that automation is not a new problem and the fundamentals still apply:
- We can’t compare what we can’t measure
- I need to ask myself, can I trust this to run on its own?
- I need to be able to describe what I want
- I need to understand my risk profile
Once you're there, you can use evals-first in a TDD ish style, and keep control throughout your journey, not over-investing, not losing control and allowing people to join in understanding those fundamentals, regardless of whether they're coming from a very solid software engineering background or not.
To be specific, for cross functional teams with > 90% non software engineers, we invested in programmatic first tools (Think claude/codex/continue.dev cli) over "hype" UI solutions because we knew integration would come through MCPs (and then skills), we designed in such a way that the agents/clis/black boxes were decoupled and no particular change/resource limitation would destroy our ability to continue. Focusing a lot on "not having to maintain" (ZeroOps) whatever we do and not having to worry [1] about manually monitoring.
For evals I suggest you explore 2 different tools, such as promptfoo (Technically no programming required for an eval creator, just yaml) and deepevals/pydanticai-evals (python based) and see how you feel about using either based on your context/team composition needs. Once you start, some things will start to click such as the value of bounding non-determinism [2] as much as possible and you might even have proof for them. Other challenges will come later (such as cost, system tests, sandboxing and more).
- [1] https://alexhans.github.io/posts/series/zeroops/no-news-is-g...
- [2] https://alexhans.github.io/posts/series/evals/error-compound...