The Role of Anchor Tokens in Self-Attention Networks(arxiv.org)

18 pointsby smookea year ago2 comments

zoppera year ago
Surprised this isn't getting more attention. It is one of those papers that is very elegant and simple, yet very effective.
- forrestpa year ago
  It's expensive in this field to verify other people's work. There are a few other papers in the last 3 years that have the same high-level idea but call the anchor tokens something different -- Gist tokens being the only one I personally remember, but you can follow the citation chains back.
  Those other papers sounded like a godsend but have deficits that you only find out about if you try to use them against non-cherry-picked use-cases. I think they are on average getting better though with time.
  They call out their limitations in the bottom of the paper. For these kinds of models, it would be nice to see them exploiting & measuring the weaknesses of compressive memory -> producing exact outputs. This would be things retrieving multiple things out of context exactly, arithmetic, or copy-pasting high-entropy bits (e.g. where a basic n-gram model can't bias you out of the blurry pieces).
  The other side of it is there is often some difficulty in reproducing training for some of these architectures -- the training can be highly unstable and both difficult + expensive to dial-in on a real-world model. We see their best training run, not their 500 runs where they changed hyperparameters b/c the loss kept exploding randomly (compare this to text-only llama-esque architectures where they are wildly stable at training time / predictable / easy to invest into and hyperparams are easy to find from prior art).
  I think we are still many papers away from something ready-for-prod on this concept, but I am personally optimistic.
wantsanagenta year ago
Someone explain to me how this isn't reinventing LSTMs please.
- toxika year ago
  I don’t understand why you think they are even similar. This is still doing pairwise attention.
  - wantsanagenta year ago
    An LSTM takes a series of values and uses a combination of gates to determine critical information to hold on to or forget as a sequence unfolds. This is a compressive technique that removes the requirement of having all previous sequence information at the time of a particular inference.
    This paper "compress sequence information into an anchor token" which is then used at inference time to reduce the information required for prediction as well as speed up that prediction. They do this via "continually pre-training the model to compress sequence information into the anchor token."