One thing I'd love to see: what does "actually works" mean in measurable terms? The engineering is sophisticated, but I'm curious about user-facing impact - did memory injection improve task completion or satisfaction? How often do users invoke /memory forget? What's the false positive rate on extraction?
These systems are hard to evaluate because failure modes are subtle - the AI "knows" something but uses it awkwardly, or surfaces context that feels intrusive. Would be great to hear what metrics you're tracking to validate the complexity is paying off.
Still trying to find a good solution here.