2 pointsby yelmahallawy5 hours ago2 comments
  • alexhans2 hours ago
    Very, very heterogenous and fast moving space.

    Depending on how they're made up, different teams do vastly different things.

    No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.

    It's definitely an afterthought for most teams although we are starting to see increased interest.

    My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.

    What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.

    - [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)

  • veloryn4 hours ago
    Lot of teams still seem to rely on ad-hoc eval sets and manual spot checks, especially for domain-specific use cases. The harder problem starts when agents or tool use enter the picture, the evaluation surface expands beyond model output quality to things like tool selection reliability, reasoning loops, cost stability, and cascading failure modes across steps. At that point you’re effectively evaluating system behavior rather than just model accuracy.