> While the original motivation for causal masking was not to provide positional information, but instead to have efficient parallelizable training, it turns out that a consistent <bos> token + causal masking is enough to perfectly reconstruct token positions.
I wish this point was explained further instead of being just a footnote. It seems like the central insight that is essential for this technique to work, and it is not obvious to me, maybe because I haven't implemented a transformer from scratch.