In summary: Without positional confounds, Transformers are a powerhouse at retrieval. Length generalization is effortless. At or above 2, head dimension does not limit retrieval capacity at all. Retrieval is geometry-driven and contains three mechanisms: separation (of hidden state geometry into a dense spherical code), projection (of the code from the hidden state), and amplification (to sharpen/saturate softmax).
Some other fun implications:
- Models can represent features in dense spherical codes, not just orthogonal axes or superpositions.
- Retrieval heads appear to cripple their own gradients upon formation.
- Mainstream positional encodings aren't designed with retrieval in mind, and are antagonistic to it. Followup experiments hint that simply including a PE is catastrophic for retrieval.
- Length generalization failures should be mostly PEs warping the learned code so separations become alignments and alignments become separations.
- "Out-of-distribution" can be seen as "never accounted for in the spherical code". If it hasn't been seen it cannot be separated, and if it hasn't been separated it cannot be distinguished.
Preprint here: https://zenodo.org/records/19359748 (Still fishing for an arXiv endorsement...)
Github repo here: https://github.com/tmaselko/paper-attncap
You can replicate the headline results in five minutes on a 4090, or the whole paper in 20-30 hours if so inclined.
I'd be happy to answer any questions, I'm kinda starved for feedback on this.