39 pointsby limocea year ago2 comments

kevmo314a year ago
This paper seems like it misses the forest for the trees. The analysis is certainly interesting and the proposal sounds viable, sort of like a sliding window attention with a little more history.
But if it is true that the separators contribute the most towards the attention scores, wouldn't that imply that the tokenization scheme can be improved? Introducing a compression scheme seems like patching around that compared to if the model naturally generated a more random attention distribution.
xp84a year ago
Or, put another way:
'Why waste time say lot token when few token do trick?"
-Kevin Malone