1 pointby stichers8 days ago1 comment
  • alexhans6 days ago
    > Curious what peeple are doing in the real world

    It's hard to answer this as the right thing to do might depend a lot on the environment.

    My pitch would be that automation is not a new problem and the fundamentals still apply:

    - We can’t compare what we can’t measure

    - I need to ask myself, can I trust this to run on its own?

    - I need to be able to describe what I want

    - I need to understand my risk profile

    Once you're there, you can use evals-first in a TDD ish style, and keep control throughout your journey, not over-investing, not losing control and allowing people to join in understanding those fundamentals, regardless of whether they're coming from a very solid software engineering background or not.

    To be specific, for cross functional teams with > 90% non software engineers, we invested in programmatic first tools (Think claude/codex/continue.dev cli) over "hype" UI solutions because we knew integration would come through MCPs (and then skills), we designed in such a way that the agents/clis/black boxes were decoupled and no particular change/resource limitation would destroy our ability to continue. Focusing a lot on "not having to maintain" (ZeroOps) whatever we do and not having to worry [1] about manually monitoring.

    For evals I suggest you explore 2 different tools, such as promptfoo (Technically no programming required for an eval creator, just yaml) and deepevals/pydanticai-evals (python based) and see how you feel about using either based on your context/team composition needs. Once you start, some things will start to click such as the value of bounding non-determinism [2] as much as possible and you might even have proof for them. Other challenges will come later (such as cost, system tests, sandboxing and more).

    - [1] https://alexhans.github.io/posts/series/zeroops/no-news-is-g...

    - [2] https://alexhans.github.io/posts/series/evals/error-compound...