Get Ready Smarter for the Certified Associate Developer Exam with Our Free and Trusted Certified Associate Developer Exam Questions – 2025 Updated.

At Cert Empire, we are dedicated to providing the latest and most accurate exam questions for students preparing for the Databricks Certified Associate Developer Exam. To support better preparation, we’ve made parts of our Certified Associate Developer exam resources free for everyone. You can practice as much as you want with Free Certified Associate Developer Practice Test.

Get Certified Associate Developer Exam Dumps

Databricks Certified Associate Developer for Apache Spark Free Exam Questions

Disclaimer

Please keep a note that the demo questions are not frequently updated. You may as well find them in open communities around the web. However, this demo is only to depict what sort of questions you may find in our original files.

Nonetheless, the premium exam dumps files are frequently updated and are based on the latest exam syllabus and real exam questions.

1 / 60

Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame todayTransactionsDf, ignoring that both DataFrames have different column names?

todayTransactionsDf.union(yesterdayTransactionsDf)

todayTransactionsDf.concat(yesterdayTransactionsDf)

union(todayTransactionsDf, yesterdayTransactionsDf)

todayTransactionsDf.unionByName(yesterdayTransactionsDf)

todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True)

2 / 60

Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?

transactionsDf.union(transactionsNewDf).distinct()

transactionsDf.union(transactionsNewDf).unique()

transactionsDf.concat(transactionsNewDf).unique()

spark.union(transactionsDf, transactionsNewDf).distinct()

transactionsDf.join(transactionsNewDf, how='union').distinct()

3 / 60

The code block shown below should return an exact copy of DataFrame transactionsDf that does not include rows in which values in column storeId have the value 25. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.select(transactionsDf.storeId!=25)

transactionsDf.drop(transactionsDf.storeId==25)

transactionsDf.where(transactionsDf.storeId!=25)

transactionsDf.filter(transactionsDf.storeId==25)

transactionsDf.remove(transactionsDf.storeId==25)

4 / 60

Which of the following statements about stages is correct?

Stages consist of one or more jobs

Different stages in a job may be executed in parallel

Stages may contain multiple actions, narrow, and wide transformations

Tasks in a stage may be executed by multiple machines at the same time

Stages ephemerally store transactions, before they are committed through actions

5 / 60

The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error.
Code block:
transactionsDf.write.partitionOn("storeId").parquet(filePath)

Column storeId should be wrapped in a col() operator

The partitionOn method should be called before the write method

No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead

The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block

The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath

6 / 60

Which of the following describes properties of a shuffle?

Shuffles involve only single partitions

In a shuffle, Spark writes data to disk

A shuffle is one of many actions in Spark

Operations involving shuffles are never evaluated lazily

Shuffles belong to a class known as 'full transformations'

7 / 60

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

transactionsDf.select('value', 'productId').distinct()

tranactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer')

transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'})

transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'})

transactionsDf.select('value').union(transactionsDf.select('productId')).distinct()

8 / 60

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

itemsDf.store()

itemsDf.cache()

itemsDf.persist(StorageLevel.MEMORY_ONLY)

itemsDf.write.option('destination', 'memory').save()

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

9 / 60

Which of the following code blocks generally causes a great amount of network traffic?

DataFrame.collect()

DataFrame.select()

DataFrame.count()

DataFrame.coalesce()

DataFrame.rdd.map()

10 / 60

Which of the following describes a narrow transformation?

Narrow transformation is an operation in which data is exchanged across partitions

A narrow transformation is a process in which data from multiple RDDs is used

A narrow transformation is an operation in which data is exchanged across the cluster

A narrow transformation is an operation in which no data is exchanged across the cluster

A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables

11 / 60

Which of the following statements about reducing out-of-memory errors is incorrect?

Reducing partition size can help against out-of-memory errors

Concatenating multiple string columns into a single column may guard against out-of-memory errors

Limiting the amount of data being automatically broadcast in joins can help against out-ofmemory errors

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors

Decreasing the number of cores available to each executor can help against out-of-memory errors

12 / 60

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green,
respectively. Find the error.
Code block:
1. spark.createDataFrame([("red",), ("blue",), ("green",)], "color")
Instead of calling spark.createDataFrame, just DataFrame should be called.

Instead of color, a data type should be specified

The commas in the tuples with the colors should be eliminated

The 'color' expression needs to be wrapped in brackets, so it reads ['color']

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples

13 / 60

Which of the following statements about the differences between actions and transformations is correct?

Actions generate RDDs, while transformations do not

Actions do not send results to the driver, while transformations do

Actions can trigger Adaptive Query Execution, while transformation cannot

Actions are evaluated lazily, while transformations are not evaluated lazily

Actions can be queued for delayed execution, while transformations can only be processed immediately

14 / 60

Which of the following code blocks returns a DataFrame containing a column dayOfYear, an integer representation of the day of the year from column openDate from DataFrame storesDF?
Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970.
A sample of storesDF is displayed below:

storesDF.withColumn("dayOfYear", get dayofyear(col("openDate")))

storesDF.withColumn("dayOfYear", dayofyear(col("openDate")))

storesDF.withColumn("dayOfYear", substr(col("openDate"), 4, 6))

(storesDF.withColumn("openDateFormat", col("openDate").cast("Date")) . withColumn("dayOfYear", dayofyear(col("openDateFormat"))))

(storesDF.withColumn("openTimestamp", col("openDate").cast("Timestamp")) . withColumn("dayOfYear", dayofyear(col("openTimestamp"))))

15 / 60

Which of the following Spark properties is used to configure whether DataFrame partitions that do not meet a minimum size threshold are automatically coalesced into larger partitions during a shuffle?

spark.sql.shuffle.partitions

spark.sql.adaptive.skewJoin.enabled

spark.sql.autoBroadcastJoinThreshold

spark.sql.adaptive.coalescePartitions.enabled

spark.sql.inMemoryColumnarStorage.batchSize

16 / 60

The code block shown below contains an error. The code block is intended to return a new 12-partition DataFrame from the 8-partition DataFrame storesDF by inducing a shuffle. Identify the error.
Code block:
storesDF.coalesce(12)

The number of resulting partitions, 12, is not achievable for an 8-partition DataFrame

The coalesce() operation cannot guarantee the number of target partitions – the repartition() operation should be used instead

The coalesce() operation will only work if the DataFrame has been cached to memory – the repartition() operation should be used instead

The coalesce() operation requires a column by which to partition rather than a number of partitions – the repartition() operation should be used instead

The coalesce() operation does not induce a shuffle and cannot increase the number of partitions – the repartition() operation should be used instead

17 / 60

Which of the following operations can be used to return a new DataFrame from DataFrame storesDF without inducing a shuffle?

storesDF.union()

storesDF.intersect()

storesDF.repartition(1)

storesDF.coalesce(1)

storesDF.rdd.getNumPartitions()

18 / 60

The code block shown below contains an error. The code block is intended to create a Python UDF assessPerformanceUDF() using the integer-returning Python function assessPerformance() and apply it to column customerSatisfaction in DataFrame storesDF. Identify the error.
Code block:
assessPerformanceUDF – udf(assessPerformance)
storesDF.withColumn("result", assessPerformanceUDF(col("customerSatisfaction")))

UDFs can only be applied vie SQL and not through the DataFrame API

The assessPerformance() operation is not properly registered as a UDF

The return type of the assessPerformanceUDF() is not specified in the udf() operation

The withColumn() operation is not appropriate here – UDFs should be applied by iterating over rows instead

The assessPerformance() operation should be used on column customerSatisfaction rather than the assessPerformanceUDF() operation

19 / 60

The code block shown below contains an error. The code block is intended to print the schema of DataFrame storesDF. Identify the error.
Code block:
storesDF.printSchema

The entire line needs to be a string – it should be wrapped by str()

The printSchema member of DataFrame is an operation and needs to be followed by parentheses

There is no printSchema member of DataFrame – the schema() operation should be used instead

There is no printSchema member of DataFrame – the getSchema() operation should be used instead

There is no printSchema member of DataFrame – schema and the print() function should be used instead

20 / 60

Which of the following code blocks returns a 15 percent sample of rows from DataFrame storesDF without replacement?

storesDF.sample()

storesDF.sample(fraction = 0.10)

storesDF.sampleBy(fraction = 0.15)

storesDF.sample(fraction = 0.15)

storesDF.sample(True, fraction = 0.10)

21 / 60

The code block shown below contains an error. The code block is intended to return a new DataFrame with the mean of column sqft from DataFrame storesDF in column sqftMean. Identify the error.
Code block:
storesDF.agg(mean("sqft").alias("sqftMean"))

The argument to the mean() operation should be a Column abject rather than a string column name

The argument to the mean() operation should not be quoted

The mean() operation is not a standalone function – it’s a method of the Column object

The only way to compute a mean of a column is with the mean() method from a DataFrame

The agg() operation is not appropriate here – the withColumn() operation should be used instead

22 / 60

Which of the following operations returns a GroupedData object?

DataFrame.group()

DataFrame.cubed()

DataFrame.groupBy()

DataFrame.GroupBy()

DataFrame.grouping_id()

23 / 60

The code block shown contains an error. The code block is intended to return a new DataFrame where column sqft from DataFrame storesDF has had its missing values replaced with the value 30,000. Identify the error.
A sample of DataFrame storesDF is displayed below:

Code block:
storesDF.na.fill(30000, col("sqft"))

The na.fill() operation does not work and should be replaced by the nafill() operation

The na.fill() operation does not work and should be replaced by the fillna() operation

The na.fill() operation does not work and should be replaced by the dropna() operation

The argument to the subset parameter of fill() should be a the numerical position of the column rather than a Column object

The argument to the subset parameter of fill() should be a string column name or a list of string column names rather than a Column object

24 / 60

Which of the following code blocks returns a new DataFrame with column storeDescription where the pattern "Description: " has been removed from the beginning of column storeDescription in DataFrame storesDF?
A sample of DataFrame storesDF is below:

storesDF.withColumn("storeDescription", regexp_replace("storeDescription", "^Description: ", ""))

storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: "))

storesDF.withColumn("storeDescription", col("storeDescription").regexp_replace("^Description: ", ""))

storesDF.withColumn("storeDescription", regexp_replace(col("storeDescription"), "^Description: ", ""))

storesDF.withColumn("storeDescription", regexp_extract(col("storeDescription"), "^Description: ", ""))

25 / 60

Which of the following code blocks returns a DataFrame where column storeCategory from DataFrame storesDF is split at the underscore character into column storeValueCategory and column storeSizeCategory?
A sample of DataFrame storesDF is displayed below:

(storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[0]) .withColumn("storeSizeCategory", split(col("storeCategory"), "_")[1]))

(storesDF.withColumn("storeValueCategory", split(col("storeCategory"), "_")[1]) .withColumn("storeSizeCategory", split(col("storeCategory"), "_")[2]))

(storesDF.withColumn("storeValueCategory", col("storeCategory").split("_")[0]) .withColumn("storeSizeCategory", col("storeCategory").split("_")[1]))

(storesDF.withColumn("storeValueCategory", split("storeCategory", "_")[0]) .withColumn("storeSizeCategory", split("storeCategory", "_")[1]))

(storesDF.withColumn("storeValueCategory", col("storeCategory").split("_")[1]) .withColumn("storeSizeCategory", col("storeCategory").split("_")[2]))

26 / 60

Which of the following code blocks returns a new DataFrame from DataFrame storesDF where column storeId is of the type string?

storesDF.withColumn("storeId, cast(storeId).as(StringType)

storesDF.withColumn("storeId, col(storeId).cast(StringType)

storesDF.withColumn("storeId, cast("storeId").as(StringType()))

storesDF.withColumn("storeId, col("storeId").cast(StringType()))

storesDF.withColumn("storeId, cast(col("storeId"), StringType()))

27 / 60

Which of the following operations can be used to create a DataFrame with a subset of columns from DataFrame storesDF that are specified by name?

storesDF.filter()

storesDF.drop()

storesDF.select()

storesDF.subset()

storesDF.selectColumn()

28 / 60

Which of the following statements about Spark DataFrames is incorrect?

Spark DataFrames are distributed

Spark DataFrames are the same as a data frame in Python or R

Spark DataFrames are immutable

Spark DataFrames are built on top of RDDs

Spark DataFrames have common Structured APIs

29 / 60

Which of the following object types cannot be contained within a column of a Spark DataFrame?

Null

Array

Vector

String

DataFrame

30 / 60

A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?

Either DataFrame can be broadcasted. Their results will be identical in result and efficiency

DataFrame A should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself

DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself

DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A

DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B

31 / 60

Which of the following cluster configurations is most likely to experience an out-of-memory error in response to data skew in a single partition?

Note: each configuration has roughly the same compute power using 100 GB of RAM and 200 cores.

Scenario #1

Scenario #4

Scenario #5

Scenario #6

More information is needed to determine an answer

32 / 60

Which of the following statements about Spark’s stability is incorrect?

Spark will spill data to disk if it does not fit in memory

Spark will recompute data cached on failed worker nodes

Spark will rerun any failed tasks due to failed worker nodes

Spark is designed to support the loss of any set of worker nodes

Spark will reassign the driver to a worker node if the driver’s node fails

33 / 60

Which of the following DataFrame operations is classified as an action?

DataFrame.coalesce()

DataFrame.take()

DataFrame.drop()

DataFrame.join()

DataFrame.filter()

34 / 60

Which of the following is the most complete description of lazy evaluation?

A process is lazily evaluated if its execution does not start until it is finished compiling

A process is lazily evaluated if its execution does not start until it reaches a specified date and time

A process is lazily evaluated if its execution does not start until it is forced to display a result to the user

A process is lazily evaluated if its execution does not start until it is put into action by some type of trigger

None of these options describe lazy evaluation

35 / 60

Which of the following operations is most likely to result in a shuffle?

DataFrame.join()

DataFrame.drop()

DataFrame.filter()

DataFrame.union()

DataFrame.where()

36 / 60

Which of the following describes the relationship between nodes and executors?

Executors and nodes are not related

There are always more nodes than executors

An executor is a processing engine running on a node

Anode is a processing engine running on an executor

There are always the same number of executors and nodes

37 / 60

Which of the following is the most granular level of the Spark execution hierarchy?

Job

Slot

Node

Task

Executor

38 / 60

The code block shown below contains an error. The code block is intended to return a DataFrame containing a column openDateString, a string representation of Java’s SimpleDateFormat. Identify the error.
Note that column openDate is of type integer and represents a date in the UNIX epoch format – the number of seconds since midnight on January 1st, 1970.
An example of Java’s SimpleDateFormat is "Sunday, Dec 4, 2008 1:05 PM".
A sample of storesDF is displayed below:

Code block:
storesDF.withColumn("openDateString", from_unixtime(col("openDate"), "EEE, MMM d, yyyy h:mm a", TimestampType()))

The from_unixtime() operation only accepts two parameters – the TimestampTime() arguments not necessary

The second argument to from_unixtime() is not correct – it should be a variant of TimestampType() rather than a string

The column openDate must first be converted to a timestamp, and then the Date() function can be used to reformat to java’s SimpleDateFormat

The from_unixtime() operation only works if column openDate is of type long rather than integer – column openDate must first be converted

The from_unixtime() operation automatically places the input column in java’s SimpleDateFormat – there is no need for a second or third argument

39 / 60

The code block shown below contains an error. The code block is intended to cache DataFrame storesDF only in Spark’s memory and then return the number of rows in the cached DataFrame. Identify the error.
Code block:
storesDF.cache().count()

DataFrames themselves cannot be cached – DataFrame storesDF must be cached as a table

The storesDF DataFrame has not been checkpointed – it must have a checkpoint in order to be cached

The cache() operation can only cache DataFrames at the MEMORY_AND_DISK level (the default) – persist() should be used instead

The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be specified to MEMORY_ONLY as an argument to cache()

The cache() operation caches DataFrames at the MEMORY_AND_DISK level by default – the storage level must be set via storesDF.storageLevel prior to calling cache()

40 / 60

The code block shown below contains an error. The code block is intended to use SQL to return a new DataFrame containing column storeId and column managerName from a table created from DataFrame storesDF. Identify the error.
Code block:
storesDF.createOrReplaceTempView("stores")
storesDF.sql("SELECT storeId, managerName FROM stores")

This cannot be accomplished using SQL – the DataFrame API should be used instead

The createOrReplaceTempView() operation does not make a Dataframe accessible via SQL

The sql() operation should be accessed via the spark variable rather than DataFrame storesDF

There is the sql() operation in DataFrame storesDF. The operation query() should be used instead

The createOrReplaceTempView() operation should be accessed via the spark variable rather than DataFrame storesDF

41 / 60

Which of the following code blocks fails to return a DataFrame reverse sorted alphabetically based on column division?

storesDF.sort(desc("division"))

storesDF.orderBy(col("division").asc())

storesDF.sort("division", ascending – False)

storesDF.orderBy(["division"], ascending = [0])

storesDF.orderBy("division", ascending – False)

42 / 60

Which of the following code blocks returns all the rows from DataFrame storesDF?

storesDF.collect()

storesDF.take()

storesDF.head()

storesDF.show()

storesDF.count()

43 / 60

Which of the following code blocks applies the function assessPerformance() to each row of DataFrame storesDF?

[assessPerformance() for row in storesDF]

[assessPerformance(row) for row in storesDF]

storesDF.collect().apply(lambda: assessPerformance)

[assessPerformance(row) for row in storesDF.take(3)]

[assessPerformance(row) for row in storesDF.collect()]

44 / 60

Which of the following code blocks returns a collection of summary statistics for all columns in DataFrame storesDF?

storesDF.describe()

storesDF.describe("all")

storesDF.summary("all")

storesDF.summary("mean")

storesDF.describe(all = True)

45 / 60

Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division in DataFrame storesDF?

storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))

storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))

storesDF.agg(approx_count_distinct(col("division"), 0.05).alias("divisionDistinct"))

storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))

storesDF.agg(approx_count_distinct(col("division"), 0.0).alias("divisionDistinct"))

46 / 60

Which of the following operations can be used to return the number of rows in a DataFrame?

DataFrame.n()

DataFrame.sum()

DataFrame.count()

DataFrame.countDistinct()

DataFrame.numberOfRows()

47 / 60

Which of the following code blocks returns a new DataFrame where column productCategories only has one word per row, resulting in a DataFrame with many more rows than DataFrame storesDF?
A sample of storesDF is displayed below:

storesDF.withColumn("productCategories", split(col("productCategories")))

storesDF.withColumn("productCategories", explode("productCategories"))

storesDF.withColumn("productCategories", col("productCategories").split())

storesDF.withColumn("productCategories", explode(col("productCategories")))

storesDF.withColumn("productCategories", col("productCategories").explode())

48 / 60

Which of the following code blocks returns a new DataFrame where column division from DataFrame storesDF has been replaced and renamed to column state and column managerName from DataFrame storesDF has been replaced and renamed to column managerFullName?

(storesDF.withColumnRenamed(["division", "state"], ["managerName", "managerFullName"])

(storesDF.withColumn("state", col("division")) .withColumn("managerFullName", col("managerName")))

(storesDF.withColumnRenamed("division", "state") .withColumnRenamed("managerName", "managerFullName"))

(storesDF.withColumn("state", "division") .withColumn("managerFullName", "managerName"))

(storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName"))

49 / 60

Which of the following operations fails to return a DataFrame with no duplicate rows?

DataFrame.distinct()

DataFrame.drop_duplicates()

DataFrame.dropDuplicates()

DataFrame.drop_duplicates(subset = "all")

DataFrame.drop_duplicates(subset = None)

50 / 60

Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30?

storesDF.filter(sqft <= 25000 or customerSatisfaction >= 30)

storesDF.filter(col(sqft) <= 25000 | col(customerSatisfaction) >= 30)

storesDF.filter(col("sqft") <= 25000 | col("customerSatisfaction") >= 30)

storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)

storesDF.filter((col("sqft") <= 25000) | (col("customerSatisfaction") >= 30))

51 / 60

Which of the following code blocks returns a new DataFrame with a new column employeesPerSqft that is the quotient of column numberOfEmployees and column sqft, both of which are from DataFrame storesDF? Note that column employeesPerSqft is not in the original DataFrame storesDF.

storesDF.select("employeesPerSqft", col("numberOfEmployees") / col("sqft"))

storesDF.select("employeesPerSqft", "numberOfEmployees" / "sqft")

storesDF.withColumn("employeesPerSqft", "numberOfEmployees" / "sqft")

storesDF.withColumn("employeesPerSqft", col("numberOfEmployees") / col("sqft"))

storesDF.withColumn(col("employeesPerSqft"), col("numberOfEmployees") / col("sqft"))

52 / 60

Which of the following operations can be used to create a new DataFrame that has 12 partitions from an original DataFrame df that has 8 partitions?

df.cache()

df.partitionBy(1.5)

df.coalesce(12)

df.repartition(12)

df.partitionBy(12)

53 / 60

The code block shown below contains an error. The code block is intended to return a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction. Identify the error.
Code block:
storesDF.drop(sqft, customerSatisfaction)

There is no drop() operation for storesDF

The sqft and customerSatisfaction column names should be quoted like "sqft" and "customerSatisfaction"

The sqft and customerSatisfaction column names should be subset from the DataFrame storesDF like storesDF."sqft" and storesDF."customerSatisfaction"

The drop() operation only works if one column name is called at a time – there should be two calls in succession like storesDF.drop("sqft").drop("customerSatisfaction")

The drop() operation only works if column names are wrapped inside the col() function like storesDF.drop(col(sqft), col(customerSatisfaction))

54 / 60

Which of the following describes the difference between cluster and client execution modes?

The cluster execution mode is run on a local cluster, while the client execution mode is run in the cloud

The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode runs a Spark job entirely on one client machine

The cluster execution mode runs the driver on the cluster machine (also known as a gateway machine or edge node), while the client execution mode runs the driver on a worker node within a cluster

The cluster execution mode runs the driver on a worker node within a cluster, while the client execution mode runs the driver on the client machine (also known as a gateway machine or edge node)

The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode submits a Spark job from a remote machine to be run on a remote, unconfigurable cluster

55 / 60

Of the following situations, in which will it be most advantageous to store DataFrame df at the MEMORY_AND_DISK storage level rather than the MEMORY_ONLY storage level?

When all of the computed data in DataFrame df can fit into memory

When the memory is full and it’s faster to recompute all the data in DataFrame df rather than read it from disk

When it’s faster to recompute all the data in DataFrame df that cannot fit into memory based on its logical plan rather than read it from disk

The storage level MENORY_ONLY will always be more advantageous because it’s faster to read data from memory than it is to read data from disk

When it’s faster to read all the computed data in DataFrame df that cannot fit into memory from disk rather than recompute it based on its logical plan

56 / 60

The default value of spark.sql.shuffle.partitions is 200. Which of the following describes what that means?

By default, Spark will only read the first 200 partitions of DataFrames to improve speed

By default, DataFrames will be split into 200 unique partitions when data is being shuffled

By default, all DataFrames in Spark will be spit to perfectly fill the memory of 200 executors

By default, new DataFrames created by Spark will be split to perfectly fill the memory of 200 executors

By default, all DataFrames in Spark, including existing DataFrames, will be split into 200 unique segments for parallelization

57 / 60

Which of the following DataFrame operations is classified as a wide transformation?

DataFrame.filter()

DataFrame.drop()

DataFrame.join()

DataFrame.union()

DataFrame.select()

58 / 60

Which of the following describes the Spark driver?

The Spare driver is fault tolerant – if it fails, it will recover the entire Spark application

The Spark driver is horizontally scaled to increase overall processing throughput of a Spark application

The Spark driver is the coarsest level of the Spark execution hierarchy – it is synonymous with the Spark application

The Spark driver is responsible for performing all execution in all execution modes – it is the entire Spark application

The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application

59 / 60

Which of the following will occur if there are more slots than there are tasks?

The Spark job will likely not run as efficiently as possible

The Spark job will use just one single slot to perform all tasks

More tasks will be automatically generated to ensure all slots are being used

Some executors will shut down and allocate all slots on larger executors first

The Spark application will fail – there must be at least as many tasks as there are slots

60 / 60

Which of the following statements about Spark jobs is incorrect?

Jobs are broken down into stages

There is no way to monitor the progress of a job

Jobs are collections of tasks that are divided up based on when an action is called

There are multiple tasks within a single job when a DataFrame has more than one partition

Jobs are collections of tasks that are divided based on when language variables are defined

Your score is

The average score is 7%

By Wordpress Quiz plugin

Free Certified Associate Developer Practice Exam – 2025 Updated

Get Ready Smarter for the Certified Associate Developer Exam with Our Free and Trusted Certified Associate Developer Exam Questions – 2025 Updated.

Disclaimer

Contact Us

[email protected]

Helpful links

Top Exams

Popular Exams

FLASH OFFER

avail $6 DISCOUNT on YOUR PURCHASE