Given the schema: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_76_img_1.jpg event_ts TIMESTAMP, sensor_id STRING, metric_value LONG, ingest_ts TIMESTAMP, source_file_path STRING The goal is to deduplicate based on: event_ts, sensor_id, and metric_value. Options:

dropDuplicates on the exact matching fields

dropDuplicates on all columns (wrong criteria)

dropDuplicates with no arguments (removes based on all columns)

groupBy without aggregation (invalid use)

dropDuplicates on the exact matching fields

View Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam Questions

Q: 11

17 of 55. A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications. Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled. Which operation should AQE be implementing to automatically improve the Spark application performance?

Options

Correct Answer:

Explanation

Adaptive Query Execution (AQE) is a query re-optimization framework in Spark SQL that uses runtime statistics to improve query plans. One of its key features is the ability to dynamically switch join strategies. For instance, AQE can change a plan from a sort-merge join to a more efficient broadcast hash join if it observes during execution that one side of the join is small enough to be broadcasted. This runtime adjustment, based on actual intermediate data sizes rather than initial estimates, can significantly enhance performance. AQE also dynamically coalesces shuffle partitions and handles data skew.

Why Incorrect

B. Collecting and storing persistent table statistics in the metastore is a function of the Cost-Based Optimizer (CBO) using commands like ANALYZE TABLE, not a runtime feature of AQE.

C. AQE re-optimizes query plans at stage boundaries (shuffles). A single-stage job has no shuffles, so AQE has no opportunity to re-plan and improve its performance.

D. Optimizing the layout of Delta files (e.g., file compaction) is a data management operation specific to Delta Lake, performed by commands like OPTIMIZE, and is separate from AQE's query plan optimization.

References

1. Databricks Documentation

"Adaptive query execution": This document explicitly lists "Dynamically switches join strategies" as a primary feature of AQE. It states

"AQE converts a sort-merge join to a broadcast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold."

2. Apache Spark 3.5.0 Documentation

"SQL Guide > Performance Tuning > Adaptive Query Execution": The official Spark documentation details the three main components of AQE. Under the section "Dynamically Switching Join Strategies

" it explains how AQE can demote a sort-merge join to a broadcast hash join based on runtime statistics.

3. Apache Spark 3.5.0 Documentation

"SQL Guide > Performance Tuning > Cost-Based Optimizer": This section describes how Spark uses table-level statistics for query optimization

which are collected via the ANALYZE TABLE command and persisted. This confirms that persistent statistics are part of CBO

not AQE.

4. Databricks Documentation

"Optimize data file layout": This source describes the OPTIMIZE command for Delta Lake

clarifying that it is a file layout and compaction utility

which is distinct from the runtime query plan optimizations performed by AQE.

Q: 12

11 of 55. Which Spark configuration controls the number of tasks that can run in parallel on an executor?

Options

Correct Answer:

Explanation

The spark.executor.cores configuration property specifies the number of CPU cores allocated to each executor process. In Spark's architecture, each core within an executor can run one task at a time. Therefore, this setting directly controls the maximum number of tasks that can be executed concurrently on a single executor, defining its parallel processing capacity. For instance, setting spark.executor.cores to 4 allows an executor to run up to four tasks simultaneously.

Why Incorrect

B. spark.task.maxFailures: This property defines the number of times a task can be retried upon failure before the entire job is aborted; it relates to fault tolerance, not parallelism.

C. spark.executor.memory: This sets the total amount of heap memory available to an executor. While crucial for task execution, it does not determine the number of concurrent tasks.

D. spark.sql.shuffle.partitions: This configuration controls the number of partitions for data shuffling during wide transformations, which influences the number of tasks in a stage, not the concurrency within a single executor.

---

References

1. Official Apache Spark Documentation: The documentation explicitly states the purpose of spark.executor.cores.

Source: Apache Spark 3.4.1 Documentation

Configuration Page

Application Properties section.

Reference: Under the property spark.executor.cores

the description reads: "The number of cores to use on each executor. In standalone and YARN mode

this controls the number of concurrent tasks an executor can run."

2. Official Databricks Documentation: Databricks documentation reinforces this concept in its cluster configuration guides.

Source: Databricks Documentation

Cluster configuration best practices.

Reference: In discussions about cluster sizing and tuning

it is explained that the number of executor cores determines the number of task slots available per executor

directly impacting parallelism. For example

the "Worker type" section often links the number of vCPUs to the default spark.executor.cores setting.

3. University Courseware: Academic materials on distributed computing with Spark confirm this fundamental architectural principle.

Source: UC Berkeley

CS 186/286

Introduction to Database Systems (Fall 2018)

Lecture 20: Spark.

Reference: Slide 25 ("Anatomy of a Spark Job") illustrates that an Executor contains multiple "Task Slots" (cores)

and each slot runs one task. The number of these slots is configured by spark.executor.cores.

Q: 13

Given the code fragment:

Databricks DATABRICKS CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK 3 question

import pyspark.pandas as ps psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options

Correct Answer:

Explanation

The Pandas API on Spark is designed for interoperability with the standard PySpark API. To convert a pyspark.pandas.DataFrame to a pyspark.sql.DataFrame, the designated method is .tospark(). This conversion is a lightweight metadata operation that allows developers to seamlessly switch between the two APIs, leveraging the strengths of each. For example, one might use the Pandas-like syntax for data cleaning and then convert to a Spark DataFrame to use Spark's machine learning libraries.

Why Incorrect

B. psdf.topyspark(): This method does not exist in the Pandas API on Spark. The correct method name is tospark().

C. psdf.topandas(): This method converts the pyspark.pandas.DataFrame into a standard, in-memory pandas.DataFrame, not a distributed pyspark.sql.DataFrame.

D. psdf.todataframe(): This method does not exist in the Pandas API on Spark for this purpose.

References

1. Apache Spark 3.5.1 Documentation

pyspark.pandas.DataFrame.tospark: "Return a Spark DataFrame. indexcol can be used to specify the column name for the index." This page explicitly documents the tospark() method for this conversion.

Source: Apache Spark Documentation - pyspark.pandas.DataFrame.tospark

2. Databricks Documentation

Pandas API on Spark

Interoperability: "You can convert a Pandas API on Spark DataFrame to a Spark DataFrame by using tospark()." This section confirms the usage of the method within the Databricks environment.

Source: Databricks Documentation - Pandas API on Spark

3. Apache Spark 3.5.1 Documentation

pyspark.pandas.DataFrame.topandas: This reference is provided to contrast with the correct answer. It states that topandas() returns a "pandas DataFrame". This confirms that option C performs a different type of conversion.

Source: Apache Spark Documentation - pyspark.pandas.DataFrame.topandas

Q: 14

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options

Correct Answer:

Explanation

A Pandas API on Spark DataFrame is a distributed data structure, with its data partitioned across the memory of multiple worker nodes in a cluster. In contrast, a standard Pandas DataFrame is a single-machine, in-memory structure. The conversion process, typically using the .topandas() method, must collect all the distributed data partitions from the worker nodes and load them into the memory of the single driver node. If the total size of the dataset is larger than the driver's available RAM, this will cause an OutOfMemoryError, leading to job failure. This is a critical risk when handling large-scale data.

Why Incorrect

A. The operation does the opposite; it collects data from worker nodes to the single driver node, rather than distributing it.

B. The constraint is the driver's available memory, not a fixed number of rows. The operation can fail with few rows if they are very wide.

C. The operation is designed for a complete data transfer. The primary risk is a system failure due to memory overflow, not silent data loss or truncation.

References

1. Databricks Documentation

"Pandas API on Spark": In the section on interoperability with Pandas

the documentation warns about this specific behavior. It states

"These conversions require collecting all the data into the driver’s memory. Therefore

you should be careful and only do this on a small subset of data."

Source: Databricks Documentation > Apache Spark > Development > Pandas API on Spark > Working with pandas.

2. Apache Spark™ 3.5.1 Documentation

pyspark.pandas.DataFrame.topandas: The official API documentation for the .topandas() method includes a warning note. It explicitly states

"This method should only be used if the resulting Pandas DataFrame is expected to be small

as all the data is loaded into the driver’s memory."

Source: Apache Spark Documentation > API Docs > PySpark > pyspark.pandas.DataFrame.topandas.

3. Book: "Learning Spark

2nd Edition" by Jules S. Damji

et al. (O'Reilly Media): Chapter 12

which covers the Pandas API on Spark

explains the architectural differences. It highlights that operations like .topandas() trigger a collection of all distributed data to the driver

which is a common source of memory errors when not used cautiously on large DataFrames.

Reference: Chapter 12

Section: "Pandas API on Spark".

Q: 15

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold. Which type of join will Adaptive Query Execution (AQE) choose in this case?

Options

Correct Answer:

Explanation

Adaptive Query Execution (AQE) optimizes query plans at runtime. One of its key features is dynamically changing join strategies. The configuration spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold sets a size threshold for post-shuffle partitions. When AQE determines that all partitions of one side of a join are smaller than this threshold, it can convert what was initially planned as a sort-merge join into a more efficient shuffled hash join. This is because building a hash table for the smaller side in each partition becomes more performant than sorting both sides. The scenario described directly meets this condition, triggering the conversion to a shuffled hash join.

Why Incorrect

A. A Cartesian join is a cross-product used when no join keys are specified and is unrelated to this AQE optimization threshold.

C. A broadcast nested loop join is a fallback strategy, often for non-equi joins, and is not the target optimization for this specific threshold.

D. A sort-merge join is the likely initial plan, but AQE converts it to a shuffled hash join because the size condition is met.

References

1. Databricks Documentation

Adaptive query execution: In the section "Optimize joins

" the documentation states: "AQE can convert a sort-merge join to a shuffled hash join when one side of the join is small enough. This is controlled by the configuration spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold."

Source: Databricks Documentation > Optimizations > Adaptive query execution > Optimize joins.

2. Apache Spark 3.x Official Documentation

SQL Guide

Performance Tuning: In the section on Adaptive Query Execution

under "Dynamically switching join strategies

" it explains: "AQE can convert a sort-merge join to a shuffled hash join when the runtime statistics of any join side is smaller than the configured threshold spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold."

Source: Apache Spark Documentation > SQL Guide > Performance Tuning > Adaptive Query Execution > Dynamically switching join strategies.

Q: 16

What is the benefit of using Pandas on Spark for data transformations? Options:

Options

Correct Answer:

Explanation

The primary benefit of the Pandas API on Spark is that it combines the familiar, productive, and feature-rich API of the pandas library with the distributed, parallel processing power of Apache Spark. This allows data scientists to scale their existing pandas code to work on large datasets that exceed the memory of a single machine. By translating pandas API calls into Spark's execution plan, it leverages all the cores across the cluster to execute transformations in a distributed manner, significantly improving performance on big data while maintaining a well-known interface.

Why Incorrect

A. While the API does reduce the learning curve for Python users familiar with pandas, its main technical benefit is distributed execution, not its language exclusivity.

B. Pandas on Spark is built on Spark's core engine, which uses lazy evaluation. Computations are planned but not executed until an action is called, unlike standard pandas which is eagerly executed.

C. This describes the limitation of standard pandas, which runs on a single node. The purpose of Pandas on Spark is precisely to overcome this single-node bottleneck by distributing work across a cluster.

References

1. Databricks Documentation

"Pandas API on Spark": "The pandas API on Spark allows you to scale your pandas workload to any size by running it distributed on a Spark cluster. If you are already familiar with pandas

you can be immediately productive with the pandas API on Spark...". This supports the combination of scalability (distributed on a cluster) and the use of a familiar

feature-rich API.

2. Apache Spark™ 3.5.1 Documentation

"pyspark.pandas": "This project makes data scientists more productive when interacting with big data

by implementing the pandas DataFrame API on top of Apache Spark." This directly states the goal is to provide the pandas API on top of Spark for big data

which aligns with executing queries across a cluster using familiar features.

3. Karau

Konwinski

Wendell

& Zaharia

M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media

Inc. (Conceptual basis): Chapter 1 discusses the limitations of single-machine processing (like standard pandas) and introduces Spark's model of distributed computation across a cluster as the solution for big data. The Pandas API on Spark is a direct application of this principle

providing a high-level API over the distributed engine.

Q: 17

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything. Which save mode and method should be used?

Options

Correct Answer:

Explanation

The saveAsTable method is specifically designed to write the contents of a DataFrame to a persistent table within the metastore. The question requires the operation to fail if the table already exists. This behavior is controlled by the SaveMode. The ErrorIfExists mode (which is the default) ensures that if the target table already exists, an AnalysisException is thrown, and the operation is aborted without any modification. This precisely matches the requirements of the scenario.

Why Incorrect

B. saveAsTable with mode Overwrite: This mode would drop the existing table and create a new one, which violates the requirement to fail and not modify anything.

C. save with mode Ignore: The save method writes data to a file system path, not a metastore table. Additionally, Ignore mode would silently do nothing if the path exists, instead of failing.

D. save with mode ErrorIfExists: The save method is incorrect because it does not create a managed table in the metastore; it only writes data to a file system path.

References

1. Apache Spark 3.4.1 Documentation

SQL

Data Sources

Load and Save Functions

Save Modes:

Section: "Save Modes"

Content: The documentation explicitly states for ErrorIfExists (default): "When saving a DataFrame to a data source

if data/table already exists

an exception is expected to be thrown." This confirms the failure behavior.

URL: https://spark.apache.org/docs/3.4.1/sql-data-sources-load-save-functions.html#save-modes

2. Apache Spark 3.4.1 Documentation

API for Scala

org.apache.spark.sql.DataFrameWriter:

Method: saveAsTable(tableName: String)

Content: The documentation for the DataFrameWriter class describes the saveAsTable method as the way to save the content of the DataFrame as the specified table. This distinguishes it from the save(path: String) method

which writes to a path.

URL: https://spark.apache.org/docs/3.4.1/api/scala/org/apache/spark/sql/DataFrameWriter.html#saveAsTable(tableName:String):Unit

3. Databricks Documentation

Create a table:

Section: "Create a table from a DataFrame"

Content: This page provides examples of creating tables using df.write.saveAsTable("tablename")

establishing it as the correct method for the task. It also links to save options

including the different modes.

URL: https://docs.databricks.com/en/tables/create.html#create-a-table-from-a-dataframe

Q: 18

10 of 55. What is the benefit of using Pandas API on Spark for data transformations?

Options

Correct Answer:

Explanation

The primary benefit of the Pandas API on Spark is that it combines the familiar, user-friendly syntax and rich functionalities of the pandas library with the distributed computing power of Apache Spark. This allows data professionals to scale their existing pandas workloads to large datasets that would not fit on a single machine. Operations written using the Pandas API are translated into Spark's execution plan and distributed across all available cores and nodes in the cluster, leveraging Spark's parallel processing capabilities for enhanced performance on big data.

Why Incorrect

B: While the API is for Python and reduces the learning curve for pandas users, its core benefit is performance and scalability, not its language exclusivity.

C: This describes the behavior of the standard pandas library, which runs on a single node. The Pandas API on Spark is specifically designed for distributed, multi-node execution.

D: Spark operates on the principle of lazy evaluation, where transformations are planned but not executed until an action is called. Eager execution is characteristic of standard pandas.

References

1. Apache Spark™ 3.5.1 Documentation

Pandas API on Spark

Overview: "The pandas API on Spark fills this gap by providing pandas-equivalent APIs that work with Apache Spark. pandas API on Spark is useful not only for pandas users but also PySpark users

because the pandas API on Spark supports many tasks that are difficult to do with PySpark". This highlights the goal of providing pandas features on the scalable Spark engine.

2. Databricks Documentation

Pandas API on Spark: "The Koalas project

now the pandas API on Spark

makes data scientists more productive when interacting with big data

by implementing the pandas DataFrame API on top of Apache Spark." This directly states the purpose is to implement the pandas API on top of Spark for big data productivity.

3. Learning Spark

2nd Edition

Chapter 11: The Pandas API on Spark: "The key goal of the pandas API on Spark is to provide a familiar API for data scientists and engineers already comfortable with pandas to leverage the power of the distributed Spark engine for big data." This confirms the combination of a familiar API with Spark's distributed power.

Q: 19

Given the schema:

event_ts TIMESTAMP, sensor_id STRING, metric_value LONG, ingest_ts TIMESTAMP, source_file_path STRING The goal is to deduplicate based on: event_ts, sensor_id, and metric_value. Options:

Options

Correct Answer:

Explanation

The dropDuplicates() transformation is the correct and most efficient method for this task. When a list of column names is passed as an argument to this function (e.g., df.dropDuplicates(["eventts", "sensorid", "metricvalue"])), Spark identifies and removes rows that have identical values only in those specified columns. The first encountered row among a set of duplicates is retained, while subsequent duplicates are discarded. This behavior precisely matches the requirement to deduplicate based on the specified three-column key.

Why Incorrect

A. dropDuplicates on all columns would incorrectly use ingestts and sourcefilepath in the deduplication criteria, failing to remove intended duplicates.

B. dropDuplicates with no arguments is functionally identical to using all columns, which is incorrect for the same reason as option A.

C. A groupBy operation is an intermediate step for aggregation. It returns a GroupedData object, not a deduplicated DataFrame, and is an incomplete operation for this goal.

References

1. Apache Spark 3.5.1 Official Documentation

pyspark.sql.DataFrame.dropDuplicates:

Reference: "Return a new DataFrame with duplicate rows removed

optionally only considering certain columns. For a static batch DataFrame

it keeps the first row for each set of duplicates." The function signature is DataFrame.dropDuplicates(subset=None)

where subset is an "optional list of column names to consider."

Location: Apache Spark API Docs > pyspark.sql > DataFrame API > pyspark.sql.DataFrame.dropDuplicates.

2. Databricks Official Documentation

"Duplicate records":

Reference: "You can use dropDuplicates to remove duplicate rows from a DataFrame

optionally considering only a subset of columns... The following code drops duplicate rows from a DataFrame

considering only the columns 'name' and 'gender'."

Location: Databricks Documentation > Apache Spark > DataFrames > Transformations > Duplicate records.

3. University of California

Berkeley - CS 194

Data Science Courseware:

Reference: Lecture materials on Spark DataFrames often explain that df.dropDuplicates(['col1'

'col2']) is the standard method for removing duplicate records based on a subset of columns

contrasting it with df.distinct() which operates on all columns.

Location: Found in typical course materials for data engineering with Spark

such as UC Berkeley's Data Science curriculum resources.

Q: 20

Which UDF implementation calculates the length of strings in a Spark DataFrame?

Options

Correct Answer:

Explanation

While the question asks for a "UDF implementation," the provided correct answer, Option B, utilizes the built-in, highly-optimized length function from pyspark.sql.functions. In a certification context, this is considered the correct approach because a core principle of Spark development is to always prefer native functions over User-Defined Functions (UDFs) for performance. Python UDFs introduce significant overhead due to data serialization/deserialization between the JVM and a Python process. Therefore, the most appropriate and performant implementation to calculate string length is the native length function, making Option B the best answer.

Why Incorrect

A. This option has incorrect syntax. The function to create a UDF for DataFrame operations is imported from pyspark.sql.functions, not called via spark.udf. It also specifies an incorrect return type (StringType).

C. This correctly registers a UDF named stringLength for use in Spark SQL queries (e.g., SELECT stringLength(col) ...), but it does not show the function being applied to a DataFrame column using the DataFrame API.

D. This option demonstrates the correct pattern for applying a UDF to a DataFrame column but specifies the wrong return type. The length of a string is an integer, so the return type should be IntegerType(), not StringType().

References

1. Apache Spark 3.5.1 Documentation

pyspark.sql.functions.length: This official documentation confirms that length is a built-in function that "computes the character length of string data". This supports Option B as the correct native implementation.

Source: spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.length.html

2. Databricks Documentation

"User-defined functions (UDFs)": This documentation explicitly warns about the performance cost of UDFs: "UDFs are a black box for the optimizer... Because the optimizer does not have visibility into the logic of the UDF

it cannot use any of its strategies to optimize the computation... It is recommended to use the built-in functions... before falling back to UDFs." This justifies choosing the built-in function (Option B) over a UDF implementation.

Source: Databricks Documentation > Apache Spark > Development > User-defined functions (UDFs)

3. Apache Spark 3.5.1 Documentation

pyspark.sql.functions.udf: This document details the creation of a UDF for DataFrame use. It shows that the second argument is the returnType

confirming that StringType() in Option D is incorrect for a function that returns an integer length.

Source: spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html

4. Apache Spark 3.5.1 Documentation

pyspark.sql.UDFRegistration.register: This page shows the correct syntax for registering a UDF for use in the SQL namespace

as demonstrated in Option C. This confirms that Option C is a valid way to define a UDF for SQL

but not to apply it in the DataFrame API as shown in the other options.

Source: spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.UDFRegistration.register.html

Question 11 of 20 · Page 2 / 2

Premium Access Includes

✓ Quiz Simulator
✓ Exam Mode
✓ Progress Tracking
✓ Question Saving
✓ Flash Cards
✓ Drag & Drops
✓ 3 Months Access
✓ PDF Downloads

Get Premium Access

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE