1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186). (DOI: https://doi.org/10.18653/v1/N19-1423). The paper's abstract and introduction define BERT as a model for pre-training deep bidirectional representations.
2. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. This paper introduces the Word2vec model, describing its architecture (Continuous Bag-of-Words and Skip-gram) for learning high-quality word vectors from text data.
3. Manning, C., & Jurafsky, D. (2023). Speech and Language Processing (3rd ed. draft). Stanford University. Chapter 6, "Vector Semantics and Embeddings," details static embeddings like Word2vec. Chapter 11, "Self-Attention and Transformers," covers the architecture underlying models like BERT.
4. Stanford University CS224N: NLP with Deep Learning | Winter 2021. Lecture 2: "Word Vectors and Word Senses." This lecture details the motivation and mechanics of the Word2vec model for creating word embeddings. Lecture 14: "Contextual Word Embeddings" covers BERT as a primary example of models that produce contextual representations.