2 pointsby alex_petrov4 hours ago2 comments

alex_petrov4 hours ago
66.88%. 80.1%. 85%. 90.79%. 93%. 100%.
These are all SOTA scores on agentic memory benchmarks. None of them tell you whether the system will work in production.
The deeper problem isn't the data — it's that we often misunderstand what these numbers actually measure. In our recent white paper we open-sourced datasets that target specific memory functions. Today we published a follow-up that explains why we think the well-known agentic memory benchmarks (LoCoMo, LongMemEval) miss the mark for production systems, and what we measure instead.
We're in a field that is measuring itself against itself. The real question isn't 'are we beating last week's leaderboard?' — it's 'are we building something that makes people's work meaningfully better?' That's harder to measure. It's also the only thing that matters.
norikaoda3 hours ago
[flagged]