1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017). Section 3.5, "Positional Encoding," Paragraph 1: "Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence."
2. Stanford University. (2023). CS224n: Natural Language Processing with Deep Learning, Lecture 9: Self-Attention and Transformers. Section: "A detail: Positional Encoding": "The self-attention layer itself is permutation-equivariant... To make the model order-aware, we need to input the position of each word. We do this with positional encodings."
3. Alammar, J. (n.d.). The Illustrated Transformer. jalammar.github.io. Section: "Positional Encoding": "One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence... To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence." (Note: While a blog, this resource is widely cited in university courseware, including Stanford's, for its accurate and clear explanation of the foundational paper).