Databricks DATABRICKS CERTIFIED ASSOCIAT…
Q: 1
A Spark application is experiencing performance issues in client mode because the driver is resource-
constrained.
How should this issue be resolved?
Options
Q: 2
How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local
Mode for testing?
Options:
Options
Q: 3
An MLOps engineer is building a Pandas UDF that applies a language model that translates English
strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting
the performance of the data pipeline.
The initial code is:
def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is
loaded?
def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is
loaded?Options
Q: 4
A developer wants to test Spark Connect with an existing Spark application.
What are the two alternative ways the developer can start a local Spark Connect server without
changing their existing application code? (Choose 2 answers)
Options
Q: 5
A data scientist has identified that some records in the user profile table contain null values in any of
the fields, and such records should be removed from the dataset before processing. The schema
includes fields like user_id, username, date_of_birth, created_ts, etc.
The schema of the user profile table looks like this:
Which block of Spark code can be used to achieve this requirement?
Options:
Which block of Spark code can be used to achieve this requirement?
Options:Options
Q: 6
16 of 55.
A data engineer is reviewing a Spark application that applies several transformations to a DataFrame
but notices that the job does not start executing immediately.
Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2
answers)
Options
Q: 7
5 of 55.
What is the relationship between jobs, stages, and tasks during execution in Apache Spark?
Options
Q: 8
Which command overwrites an existing JSON file when writing a DataFrame?
Options
Q: 9
A data engineer is working on a real-time analytics pipeline using Apache Spark Structured
Streaming. The engineer wants to process incoming data and ensure that triggers control when the
query is executed. The system needs to process data in micro-batches with a fixed interval of 5
seconds.
Which code snippet the data engineer could use to fulfil this requirement?
A)
B)
C)
D)
Options:
B)
C)
D)
Options:Options
Q: 10
Given the code:
df = spark.read.csv("large_dataset.csv")
filtered_df = df.filter(col("error_column").contains("error"))
mapped_df = filtered_df.select(split(col("timestamp"), " ").getItem(0).alias("date"),
lit(1).alias("count"))
reduced_df = mapped_df.groupBy("date").sum("count")
reduced_df.count()
reduced_df.show()
At which point will Spark actually begin processing the data?
df = spark.read.csv("large_dataset.csv")
filtered_df = df.filter(col("error_column").contains("error"))
mapped_df = filtered_df.select(split(col("timestamp"), " ").getItem(0).alias("date"),
lit(1).alias("count"))
reduced_df = mapped_df.groupBy("date").sum("count")
reduced_df.count()
reduced_df.show()
At which point will Spark actually begin processing the data?Options
Q: 11
17 of 55.
A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to
Spark 3.5 has improved the runtime of some scheduled Spark applications.
Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.
Which operation should AQE be implementing to automatically improve the Spark application
performance?
Options
Q: 12
11 of 55.
Which Spark configuration controls the number of tasks that can run in parallel on an executor?
Options
Q: 13
Given the code fragment:
import pyspark.pandas as ps
psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into
a standard PySpark DataFrame (pyspark.sql.DataFrame)?
import pyspark.pandas as ps
psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into
a standard PySpark DataFrame (pyspark.sql.DataFrame)?Options
Q: 14
What is the risk associated with this operation when converting a large Pandas API on Spark
DataFrame back to a Pandas DataFrame?
Options
Q: 15
A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for
spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.
Which type of join will Adaptive Query Execution (AQE) choose in this case?
Options
Q: 16
What is the benefit of using Pandas on Spark for data transformations?
Options:
Options
Q: 17
A data engineer wants to write a Spark job that creates a new managed table. If the table already
exists, the job should fail and not modify anything.
Which save mode and method should be used?
Options
Q: 18
10 of 55.
What is the benefit of using Pandas API on Spark for data transformations?
Options
Q: 19
Given the schema:
event_ts TIMESTAMP,
sensor_id STRING,
metric_value LONG,
ingest_ts TIMESTAMP,
source_file_path STRING
The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.
Options:
event_ts TIMESTAMP,
sensor_id STRING,
metric_value LONG,
ingest_ts TIMESTAMP,
source_file_path STRING
The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.
Options:Options
Q: 20
Which UDF implementation calculates the length of strings in a Spark DataFrame?
Options
Q: 21
1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the
output DataFrame contains the entire contents of a single file and the full path of the file the text was
read from.
The first attempt does read the text files, but each record contains a single line. This code is shown
below:
txt_path = "/datasets/raw_txt/*"
df = spark.read.text(txt_path)
# one row per line by default
df = df.withColumn("file_path", input_file_name()) # add full path
Which code change can be implemented in a DataFrame that meets the data scientist's
requirements?
Options
Q: 22
6 of 55.
Which components of Apache Spark’s Architecture are responsible for carrying out tasks when
assigned to them?
Options
Q: 23
22 of 55.
A Spark application needs to read multiple Parquet files from a directory where the files have
differing but compatible schemas.
The data engineer wants to create a DataFrame that includes all columns from all files.
Which code should the data engineer use to read the Parquet files and include all columns using
Apache Spark?
Options
Q: 24
An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:
python
CopyEdit
from pyspark.sql.functions import broadcast
result = df2.join(broadcast(df1), on='id', how='inner')
What is the purpose of using broadcast() in this scenario?
Options:
Options
Q: 25
What is the benefit of Adaptive Query Execution (AQE)?
Options
Question 1 of 25