1. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, 2.
Reference Details: Section 3.2, "Job Scheduling," explains: "Our scheduler is similar to Dryad’s... However, it pipelines functions with narrow dependencies into a single stage... For example, a map followed by a filter can be performed in a single pass over the data... It creates a new stage at each wide dependency." This directly states that wide dependencies (shuffles) are the basis for stage creation.
2. Huawei HCIP-Big Data Developer V2.0 Training Material.
Reference Details: In the "Spark Core Principle" module, the section on the "DAGScheduler" details the process of stage division. It explicitly states that the DAGScheduler scans the RDD graph backwards from the final RDD, and whenever it encounters a shuffle dependency (wide dependency), it creates a new stage for the RDDs involved in that shuffle.
3. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media, Inc.
Reference Details: Chapter 4, "Spark Programming Model," in the section "How Spark Executes Applications," describes the role of the DAG scheduler. It clarifies that the scheduler groups RDDs into stages based on whether a shuffle is required, with shuffle operations marking the boundary between stages.