We tried eval platforms, LLM-as-judge, and automated prompt optimizers. None helped with what actually mattered: hidden domain policies that weren’t explicitly written anywhere.
We ended up building our own annotation UI, prompt integration workflow (via Claude Code SDK), and HTML diff-based experiment reports.
The biggest lesson: off-the-shelves Eval/Annotations/Prompt Optimization tools are sub-part because they can only be generic.
Curious whether others building AI products have reached the same conclusion.