In general, we've struggled with how much the LLM eval/observability space centers around pre-baked scorers like ROUGE/BLEU, or off the shelf LLM-as-a-judge setups. I think it's fair to say the current crop of frontier models have mostly “solved” basic summarization and QA over a fixed set of docs. Measuring word overlap between a generated summary and source text doesn’t tell us much about whether a response actually helped solve a real problem.
In practice, the root failures we see look different: - pulling the wrong documents on the retrieval side - partially following formatting instructions - drifting across multi-turn interactions - responses that look reasonable but lead to user follow-ups because something was missing or off
So hopefully the post is a helpful guide on some of the more sustainable eval strategies that have worked nicely for us :)