1. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of the 7th International Conference on Learning Representations (ICLR). In the abstract, the authors state, "We present the General Language Understanding Evaluation (GLUE) benchmark, a collection of nine natural language understanding tasks... to favor models that share general linguistic knowledge across tasks." (Available at: https://arxiv.org/pdf/1804.07461.pdf, Page 1, Abstract).
2. Manning, C. D. (2022). CS224N: Natural Language Processing with Deep Learning, Lecture 13: Contextual Word Representations. Stanford University. The lecture materials introduce GLUE as a key benchmark for evaluating large pre-trained models like BERT on a "battery of different NLU tasks." (Slide 19, "Evaluating models: Benchmarks").