I agree that there is a fundamental issue with v1-style retrieval but in my view is not the scoring formula but the fact that similarity search mixes semantically related with really valuable data. For example a memory about "surfing last weekend" and a memory about "wanting to surf one day in Hawaii" will both score high in the question "What outdoor activities do I like?". However, in the question "what did I do last weekend?" only one is useful while both will appear in the injected context. One way that might solve this issue is by introducing more retrieval dimensions like keyword-matching (BM25), entity-aware scoring and temporal signals in order to then determine which memories are truly relevant to the users question. This of course adds up during ingestion but in general, async ingestion is underrated. Generally speaking users expect near-instant responses while ingestion can be slower.
If I may ask, have you done any benchmarking on the v3 approach? It would be interesting to see how a v3-style solution handles factual vs general preference questions. This usually is a tricky one for memory systems.