The memory works by tracking two kinds of "surprise" - immediate surprise (how unexpected is the current token?) and accumulated surprise (what patterns of unexpected things have we been seeing?). It uses this to decide what's worth remembering and what can be forgotten. What's clever is they formulated this as a gradient descent problem that can run efficiently in parallel despite being inherently sequential.
The really interesting part is how it integrates with the main model - they tried three approaches but the most effective was using the memory as additional context tokens alongside the input. This lets the attention mechanism figure out for itself when to use the memory versus the immediate context. And because the memory tokens are injected both at test-time and during training, the memory model and main model are trained together despite the memory model being unfrozen for each inference.
In practice, this lets them handle sequences over 2M tokens long while outperforming traditional transformers, even matching GPT-4 on some long-context reasoning tasks with far fewer parameters. It's a neat example of combining classical ideas about online learning with modern deep learning architectures.
The code isn't released yet, but the paper suggests the implementation is relatively straightforward since it builds on standard gradient descent mechanics. It'll be interesting to see if this approach influences the next generation of open source LLMs. I'm sure we will see implementations very soon even though it may take some time for open source models to be trained using this new architecture.
I'm very excited to know whether Gemini 2.0 1206 Experimental is using this new architecture. I suspect it is.
After reading a few times, I gather that, rather than kernelizing or linearizing attention (which has been thoroughly explored in the literature), they are using a MLP to do run-time modelling of the attention operation. If that's the case (?), (which is interesting, sure): 1 -- Why did they not say this plainly. 2 -- Why does eq. 12 show the memory MLP being indexed by the key, whereas eq. 15 shows it indexed by the query? 3 -- What's with all the extra LSTM-esque forget and remember gates? Meh. Wouldn't trust it without ablations.
I guess if a MLP can model a radiance field (NeRF) well, stands to reason it can approx attention too. The Q,K,V projection matrices will need to be learned beforehand using standard training.
While the memory & compute savings are clear, uncertain if this helps with reasoning or generalization thereof. I doubt that too.
The eq. 15 is simply the operation to query a value that was previously inserted in previous tokens using eq. 12.
Basically, for each autoregressively processed segmented you do:
1) Test-time inference: query values from memory with eq. 15.
2) Test-time training: associate new keys and values into the memory with the loss from eq. 12.
The forget and remember gates is because... well, the architecture in general is very similar to a LSTM, but using test-time gradient descent to decide what to insert to the long-term memory.
Seems the implicit assumption then is that M(q) -> v 'looks like' or 'is smooth like' the dot product, otherwise 'train on keys, inference on queries' wouldn't work ? (safe assumption imo with that l2 norm & in general; unsafe if q and k are from different distributions).
Correct me if I'm wrong, but typically k and v are generated via affine projections K, V of the tokens; if M is matrix-valued and there are no forget and remember gates (to somehow approx the softmax?), then M = V K^-1
and i tried to unpack it a bit here https://wdmn.fr/rank-1-take-on-rwkv7s-in-context-learning/
They have several other "We took the best of both worlds" type papers.
1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.
However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?
2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.
We just take it for granted.
"Attention Is All You Need" ring any bells?