1. University Courseware:
Karpathy, A. (n.d.). CS231n: Convolutional Neural Networks for Visual Recognition. Stanford University. In Module 1, "Neural Networks Part 2: Setting up the Data and the Loss," the course notes emphasize data preprocessing and performing sanity checks on the data as the very first step. It states, "The first step is to look at your data... Before you write a single line of code, you should know what your data is like." This directly aligns with the goal of EDA to uncover patterns and anomalies.
2. Official Vendor Documentation:
NVIDIA. (2023). NeMo Framework Documentation: Data Curation with NeMo Curator. The NeMo Curator tool is designed for preparing large-scale datasets for LLMs. Its functionalities, such as quality filtering (e.g., detecting "bad" text), deduplication, and bias analysis, are all processes informed by or part of a comprehensive EDA. The documentation highlights the need to analyze and filter data to improve model performance, which is the core objective of applying EDA.
3. Peer-reviewed Academic Publications:
Zha, K., et al. (2023). Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158. Section 3, "The Data-Centric AI Pipeline," describes the initial stage as "Data Understanding," which corresponds to EDA. The authors state this step involves analyzing data properties, distributions, and potential issues like biases and noise to inform subsequent data improvement strategies. (DOI: https://doi.org/10.48550/arXiv.2303.10158)