Depending on how they're made up, different teams do vastly different things.
No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.
It's definitely an afterthought for most teams although we are starting to see increased interest.
My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.
What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.
- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)