Question 20 - Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Real Exam Questions [Feb 2026 Update]

Q: 20

Which UDF implementation calculates the length of strings in a Spark DataFrame?

Options

Correct Answer:

Explanation

While the question asks for a "UDF implementation," the provided correct answer, Option B, utilizes the built-in, highly-optimized length function from pyspark.sql.functions. In a certification context, this is considered the correct approach because a core principle of Spark development is to always prefer native functions over User-Defined Functions (UDFs) for performance. Python UDFs introduce significant overhead due to data serialization/deserialization between the JVM and a Python process. Therefore, the most appropriate and performant implementation to calculate string length is the native length function, making Option B the best answer.

Why Incorrect

A. This option has incorrect syntax. The function to create a UDF for DataFrame operations is imported from pyspark.sql.functions, not called via spark.udf. It also specifies an incorrect return type (StringType).

C. This correctly registers a UDF named stringLength for use in Spark SQL queries (e.g., SELECT stringLength(col) ...), but it does not show the function being applied to a DataFrame column using the DataFrame API.

D. This option demonstrates the correct pattern for applying a UDF to a DataFrame column but specifies the wrong return type. The length of a string is an integer, so the return type should be IntegerType(), not StringType().

References

1. Apache Spark 3.5.1 Documentation

pyspark.sql.functions.length: This official documentation confirms that length is a built-in function that "computes the character length of string data". This supports Option B as the correct native implementation.

Source: spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.length.html

2. Databricks Documentation

"User-defined functions (UDFs)": This documentation explicitly warns about the performance cost of UDFs: "UDFs are a black box for the optimizer... Because the optimizer does not have visibility into the logic of the UDF

it cannot use any of its strategies to optimize the computation... It is recommended to use the built-in functions... before falling back to UDFs." This justifies choosing the built-in function (Option B) over a UDF implementation.

Source: Databricks Documentation > Apache Spark > Development > User-defined functions (UDFs)

3. Apache Spark 3.5.1 Documentation

pyspark.sql.functions.udf: This document details the creation of a UDF for DataFrame use. It shows that the second argument is the returnType

confirming that StringType() in Option D is incorrect for a function that returns an integer length.

Source: spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html

4. Apache Spark 3.5.1 Documentation

pyspark.sql.UDFRegistration.register: This page shows the correct syntax for registering a UDF for use in the SQL namespace

as demonstrated in Option C. This confirms that Option C is a valid way to define a UDF for SQL

but not to apply it in the DataFrame API as shown in the other options.

Source: spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.UDFRegistration.register.html

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE