2 pointsby alex_petrov4 hours ago2 comments
  • alex_petrov4 hours ago
    66.88%. 80.1%. 85%. 90.79%. 93%. 100%.

    These are all SOTA scores on agentic memory benchmarks. None of them tell you whether the system will work in production.

    The deeper problem isn't the data — it's that we often misunderstand what these numbers actually measure. In our recent white paper we open-sourced datasets that target specific memory functions. Today we published a follow-up that explains why we think the well-known agentic memory benchmarks (LoCoMo, LongMemEval) miss the mark for production systems, and what we measure instead.

    We're in a field that is measuring itself against itself. The real question isn't 'are we beating last week's leaderboard?' — it's 'are we building something that makes people's work meaningfully better?' That's harder to measure. It's also the only thing that matters.

  • norikaoda3 hours ago
    [flagged]