1. Stanford University
CS224N: NLP with Deep Learning. In the Winter 2023 lectures on "Pretraining and Large Language Models
" it is established that models like BERT and GPT are pre-trained on massive unlabeled text corpora (e.g.
Wikipedia
BooksCorpus) to learn general language representations. This highlights the core concept of learning from large-scale data. (Reference: Stanford CS224N
Winter 2023
Lecture 12 Slides 15-25).
2. Brown
T. B.
et al. (2020). Language Models are Few-Shot Learners. The paper introducing GPT-3 explicitly states that the model was trained on a dataset combining Common Crawl
WebText2
Books1
Books2
and Wikipedia
totaling hundreds of billions of tokens. The paper's central thesis is that scaling up the data and model size improves performance
reinforcing the data-driven learning concept. (Reference: Section 2.2
"Training Dataset
" Page 8). DOI: https://doi.org/10.48550/arXiv.2005.14165
3. Zhao
W. X.
et al. (2023). A Survey of Large Language Models. This comprehensive survey paper states
"The development of LLMs is mainly driven by the scaling of data
model size
and computation." It details the pre-training phase where models learn from "large-scale corpora in a self-supervised manner." (Reference: Section 2.1
"Main-stream LLMs
" Page 3). DOI: https://doi.org/10.48550/arXiv.2303.18223