Most AI systems today treat memory as ephemeral. Context is fetched from a remote store, used briefly, and discarded. Persistence is something you layer on later, usually via a network call to a database that sits outside the reasoning loop. This model works reasonably well when interactions are short lived and connectivity is assumed.
It starts to feel fragile when systems are expected to run continuously, survive restarts, operate offline, or reason repeatedly over long histories. In those cases, memory access becomes part of the critical path rather than a background concern.
What struck me while working on this is that many performance and cost problems people attribute to “scale” are really consequences of where memory lives. If every recall requires a network hop, then latency, reliability, and cost are inherently coupled to usage. You can hide that with caching and batching, but the constraint never goes away.
We’ve been exploring an alternative approach where persistence is treated as part of the hot path instead of something bolted on. Memory lives locally alongside the application, survives restarts by default, and is accessed at hardware speed. Once retrieval stops leaving the machine, a few second order effects emerge that surprised us. Cost stops scaling with traffic. Recovery stops being an operational event. Systems behave the same whether they are online, offline, or at the edge.
I’m very early in this commercially and building it with a co founder, but before locking in assumptions I wanted to sanity check the architectural framing with people here. Does this line up with how others see AI systems evolving, or do you think the current model of ephemeral memory plus remote persistence is still the right long term abstraction?
I’ve documented the architecture and tradeoffs of what we’ve built so far here for anyone who wants concrete details.
I’m much more interested in the discussion than the implementation itself.