Question 9 - Huawei H13-711 Real Exam Questions [March 2026 Update]

Q: 9

When the Spark application is running, what is the stage division basis?

Options

Correct Answer:

Explanation

In Apache Spark, the Directed Acyclic Graph (DAG) scheduler is responsible for dividing a job into a set of stages. The basis for this division is the type of dependency between Resilient Distributed Datasets (RDDs). When a transformation requires a data shuffle across the cluster, it is known as a wide dependency. The DAG scheduler creates a new stage at each wide dependency boundary. Transformations with narrow dependencies, where data can be processed on the same partition without shuffling, are grouped (pipelined) together into a single stage. This approach optimizes execution by minimizing costly network data transfer.

Why Incorrect

A. task: A task is the smallest unit of work executed on a single partition within a stage, not the criterion for creating a stage.

B. taskSet: A TaskSet is a collection of tasks that constitute a single stage. It is the result of stage division, not the cause.

C. action: An action (e.g., count, collect) triggers the execution of a job, which is then broken down into one or more stages.

References

1. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, 2.

Reference Details: Section 3.2, "Job Scheduling," explains: "Our scheduler is similar to Dryad’s... However, it pipelines functions with narrow dependencies into a single stage... For example, a map followed by a filter can be performed in a single pass over the data... It creates a new stage at each wide dependency." This directly states that wide dependencies (shuffles) are the basis for stage creation.

2. Huawei HCIP-Big Data Developer V2.0 Training Material.

Reference Details: In the "Spark Core Principle" module, the section on the "DAGScheduler" details the process of stage division. It explicitly states that the DAGScheduler scans the RDD graph backwards from the final RDD, and whenever it encounters a shuffle dependency (wide dependency), it creates a new stage for the RDDs involved in that shuffle.

3. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media, Inc.

Reference Details: Chapter 4, "Spark Programming Model," in the section "How Spark Executes Applications," describes the role of the DAG scheduler. It clarifies that the scheduler groups RDDs into stages based on whether a shuffle is required, with shuffle operations marking the boundary between stages.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE