3 pointsby choffer6 hours ago1 comment
  • choffer5 hours ago
    A bit of a rant dressed up as an AI engineering post on LLM eval strategies.

    In general, we've struggled with how much the LLM eval/observability space centers around pre-baked scorers like ROUGE/BLEU, or off the shelf LLM-as-a-judge setups. I think it's fair to say the current crop of frontier models have mostly “solved” basic summarization and QA over a fixed set of docs. Measuring word overlap between a generated summary and source text doesn’t tell us much about whether a response actually helped solve a real problem.

    In practice, the root failures we see look different: - pulling the wrong documents on the retrieval side - partially following formatting instructions - drifting across multi-turn interactions - responses that look reasonable but lead to user follow-ups because something was missing or off

    So hopefully the post is a helpful guide on some of the more sustainable eval strategies that have worked nicely for us :)