Given the schema: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_76_img_1.jpg event_ts TIMESTAMP, sensor_id STRING, metric_value LONG, ingest_ts TIMESTAMP, source_file_path STRING The goal is to deduplicate based on: event_ts, sensor_id, and metric_value. Options:

dropDuplicates on the exact matching fields

Question 19 - Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Real Exam Questions [Feb 2026 Update]

Q: 19

Given the schema:

Databricks DATABRICKS CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK 3 question

event_ts TIMESTAMP, sensor_id STRING, metric_value LONG, ingest_ts TIMESTAMP, source_file_path STRING The goal is to deduplicate based on: event_ts, sensor_id, and metric_value. Options:

Options

Correct Answer:

Explanation

The dropDuplicates() transformation is the correct and most efficient method for this task. When a list of column names is passed as an argument to this function (e.g., df.dropDuplicates(["eventts", "sensorid", "metricvalue"])), Spark identifies and removes rows that have identical values only in those specified columns. The first encountered row among a set of duplicates is retained, while subsequent duplicates are discarded. This behavior precisely matches the requirement to deduplicate based on the specified three-column key.

Why Incorrect

A. dropDuplicates on all columns would incorrectly use ingestts and sourcefilepath in the deduplication criteria, failing to remove intended duplicates.

B. dropDuplicates with no arguments is functionally identical to using all columns, which is incorrect for the same reason as option A.

C. A groupBy operation is an intermediate step for aggregation. It returns a GroupedData object, not a deduplicated DataFrame, and is an incomplete operation for this goal.

References

1. Apache Spark 3.5.1 Official Documentation

pyspark.sql.DataFrame.dropDuplicates:

Reference: "Return a new DataFrame with duplicate rows removed

optionally only considering certain columns. For a static batch DataFrame

it keeps the first row for each set of duplicates." The function signature is DataFrame.dropDuplicates(subset=None)

where subset is an "optional list of column names to consider."

Location: Apache Spark API Docs > pyspark.sql > DataFrame API > pyspark.sql.DataFrame.dropDuplicates.

2. Databricks Official Documentation

"Duplicate records":

Reference: "You can use dropDuplicates to remove duplicate rows from a DataFrame

optionally considering only a subset of columns... The following code drops duplicate rows from a DataFrame

considering only the columns 'name' and 'gender'."

Location: Databricks Documentation > Apache Spark > DataFrames > Transformations > Duplicate records.

3. University of California

Berkeley - CS 194

Data Science Courseware:

Reference: Lecture materials on Spark DataFrames often explain that df.dropDuplicates(['col1'

'col2']) is the standard method for removing duplicate records based on a subset of columns

contrasting it with df.distinct() which operates on all columns.

Location: Found in typical course materials for data engineering with Spark

such as UC Berkeley's Data Science curriculum resources.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE