An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline. The initial code is: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks_DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK-3.5/page_5_img_1.jpg def in_spanish_inner(df: pd.Series) -> pd.Series: model = get_translation_model(target_lang='es') return df.apply(model) in_spanish = sf.pandas_udf(in_spanish_inner, StringType()) How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Question 3 - Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Real Exam Questions [Feb 2026 Update]

Q: 3

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline. The initial code is:

Databricks DATABRICKS CERTIFIED ASSOCIATE DEVELOPER FOR APACHE SPARK 3 question

def in_spanish_inner(df: pd.Series) -> pd.Series: model = get_translation_model(target_lang='es') return df.apply(model) in_spanish = sf.pandas_udf(in_spanish_inner, StringType()) How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Options

Correct Answer:

Explanation

The Iterator[Series] -> Iterator[Series] Pandas UDF is the ideal pattern for this scenario. This type of UDF is invoked once for each data partition. The function receives an iterator of pandas Series, allowing the expensive model (gettranslationmodel) to be loaded just once at the beginning of the function. The code can then iterate through the batches of data within the partition, applying the already-loaded model to each batch. This avoids the significant performance overhead of re-initializing the model for every single batch, which is what happens with a standard Series -> Series UDF.

Why Incorrect

A. Converting to a standard PySpark UDF would be less performant, as it operates row-by-row, potentially loading the model for every single row instead of for each batch.

B. A Series -> Scalar UDF still processes data in batches (pd.Series), so it would not solve the underlying problem of reloading the model for each batch.

C. While mapInPandas also uses an iterator pattern, the most direct and idiomatic way to modify an existing Pandas UDF for this optimization is to change its type to Iterator[Series] -> Iterator[Series].

References

1. Apache Spark Documentation

Pandas UDFs (a.k.a. Vectorized UDFs): The section on "Iterator of Series to Iterator of Series UDF" explicitly states: "It is useful when the UDF execution requires an expensive initialization... The example below shows how to use an iterator of series UDF to load a large model once and then apply it to each series in the partition." This directly addresses the question's scenario.

Source: Apache Spark 3.5.0 Documentation

pyspark.sql.functions.pandasudf

Section: "Iterator of Series to Iterator of Series UDF".

2. Databricks Documentation

User-defined functions - Python: The documentation describes the Iterator[pd.Series] -> Iterator[pd.Series] UDF and highlights its primary use case: "This is useful when the UDF execution requires an expensive state to be initialized... The state can be created once and used for the entire partition."

Source: Databricks Documentation

"User-defined functions - Python"

Section: "Pandas UDFs (vectorized UDFs)"

Subsection: "Iterator of Series to Iterator of Series".

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE