2 pointsby dial4814 hours ago1 comment
  • dial4814 hours ago
    We audited the LoCoMo benchmark (one of the most cited eval for LLM agent memory) and found 99 score-corrupting errors in 1,540 questions (6.4%). Separately, we tested the LLM judge with adversarially generated wrong answers, it accepted 62.81% of vague-but-topical wrong answers. Some published system scores barely clear that bar. Full audit with methodology, all 99 errors documented, and reproducible scripts.
    • PaulHoule4 hours ago
      I've worked in IR and this has been true about TREC data sets from the beginning and it has also been true about visual data sets. The first step to build a world beating commercial system has been to clean up the garbage in open evals to raise the possible accuracy ceiling.