A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space. As a result, they have the following code block: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Machine-Learning-Associate/page_18_img_1.jpg Which of the following changes do they need to make to the above code block in order to accomplish the task?

Change SparkTrials() to Trials()

Reduce num_evals to be less than 10

Remove the trials=trials argument

Remove the algo=tpe.suggest argument

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model. The Spark DataFrame train_df has the following schema: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Machine-Learning-Associate/page_19_img_1.jpg The machine learning engineer shares the following code block: Which of the following changes does the machine learning engineer need to make to complete the task?

They need to convert the features column to be a vector

They need to call the transform method on train df

They need to convert the features column to be a vector

They do not need to make any changes

They need to utilize a Pipeline to fit the model

They need to split the features column out into one column for each feature

A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Machine-Learning-Associate/page_16_img_1.jpg Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set. Which of the following changes can the data scientist make to address the concern?

Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values

Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values

Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data

Utilize the Pipeline API to standardize the training data according to the test data's summary statistics

Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

Free Machine Learning Associate Practice Exam

Study Smarter for the Machine Learning Associate Exam with Our Free and Accurate Machine Learning Associate Exam Questions – Updated for 2025.

At Cert Empire, we are committed to providing the most reliable and up-to-date exam questions for students preparing for the Databricks Machine Learning Associate Exam. To help learners study more effectively, we’ve made sections of our Machine Learning Associate exam resources free for everyone. You can practice as much as you want with Free Machine Learning Associate Practice Test.

Get Machine Learning Associate Exam Questions

Databricks Machine Learning Associate

Q: 1

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one- hot encoded within the feature repository. Which of the following explanations justifies this suggestion?

Options

Correct Answer:

Explanation

A feature repository, or feature store, is designed to create and manage features for reuse across multiple machine learning models and teams. The choice of feature encoding can be highly dependent on the specific machine learning algorithm being used. One-hot encoding (OHE) creates a high-dimensional, sparse representation of categorical data. While this is necessary for algorithms like linear regression, it can be suboptimal or even detrimental for others, such as tree-based models (e.g., Random Forest, Gradient Boosting), which can handle categorical features more naturally. Storing the raw categorical feature in the repository provides maximum flexibility, allowing each downstream modeling pipeline to apply the most appropriate encoding strategy for its specific algorithm.

Why Incorrect

A. This is factually incorrect. One-hot encoding is a standard, widely supported preprocessing step in all major machine learning libraries, including Scikit-learn and Spark MLlib.

B. This describes target (or mean) encoding. One-hot encoding is an unsupervised transformation that depends only on the values within the feature itself, not the target variable.

C. While OHE can be computationally intensive for high-cardinality features, the primary justification for avoiding it in a feature store is its model-specific nature, not just its computational cost.

D. This is factually incorrect. One-hot encoding is one of the most common and fundamental strategies for numerically representing categorical features for many model types.

References

1. Official Vendor Documentation (Databricks): The Databricks documentation on "Feature engineering in Unity Catalog" emphasizes the principle of reusability. It states

"The feature store provides a centralized repository that enables discovery and reuse of features." Pre-applying a model-specific transformation like OHE would limit this reusability

as not all models benefit from it. This aligns with the principle of keeping features in a more general state within the store.

Source: Databricks Documentation

"Feature engineering in Unity Catalog > What is a feature store?".

2. Academic Publication: In the textbook The Elements of Statistical Learning

the authors explain how different model families handle categorical predictors. Linear models require dummy variables (one-hot encoding)

while tree-based models can handle them natively. This highlights that the encoding strategy is algorithm-dependent.

Source: Hastie

Tibshirani

& Friedman

J. (2009). The Elements of Statistical Learning: Data Mining

Inference

and Prediction. Springer. Chapter 9

Section 9.2.4 "Other Issues

" discusses how tree-based models can handle categorical predictors without creating dummy variables.

3. University Courseware: Stanford's course on Machine Learning Systems Design discusses the architecture of feature stores. A key principle is to decouple feature generation from model-specific transformations. The feature store should provide consistent

raw (or lightly processed) features

while model-specific steps like OHE (whose parameters depend on the training data's vocabulary) should be part of the model training pipeline to ensure consistency and appropriateness for the chosen algorithm.

Source: Stanford University

CS 329S: Machine Learning Systems Design

Winter 2021

Lecture 4: "Data Engineering."

Q: 2

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem: ● Hyperparameter 1: [2, 5, 10] ● Hyperparameter 2: [50, 100] Which of the following represents the number of machine learning models that can be trained in parallel during this process?

Options

Correct Answer:

Explanation

The total number of models to be trained is determined by multiplying the number of hyperparameter combinations by the number of cross-validation folds.

First, calculate the number of hyperparameter combinations from the grid search:

(Values for Hyperparameter 1) × (Values for Hyperparameter 2) = 3 × 2 = 6 combinations.

Next, for each of these 6 combinations, 3-fold cross-validation is performed. This requires training a separate model for each fold.

Total models = 6 combinations × 3 folds = 18 models.

Since each of these 18 model training runs is an independent task, a distributed computing platform like Databricks can execute all of them in parallel, assuming sufficient cluster resources are available.

Why Incorrect

A. 3: This represents only the number of cross-validation folds, not the total number of models trained across all hyperparameter combinations.

B. 5: This is the sum of the number of hyperparameter values (3 + 2), which is an incorrect calculation for a grid search.

C. 6: This correctly identifies the number of hyperparameter combinations (3 × 2) but omits the 3 models trained for each combination due to 3-fold cross-validation.

---

References

1. Apache Spark Official Documentation

pyspark.ml.tuning.CrossValidator: The documentation for Spark's CrossValidator describes the process: "For each paramMap

CrossValidator will split the dataset into k folds. Then it will train on k-1 folds and evaluate on the remaining fold." This confirms that for each hyperparameter combination (a paramMap)

k models are trained (one for each fold). The parallelism parameter further confirms that these model fits can be executed in parallel. In this scenario

there are 6 paramMaps and k=3

resulting in 18 total model fits that can be parallelized.

Source: Apache Spark 3.5.0 Documentation

MLlib: Main Guide > ML Tuning: model selection and hyperparameter tuning > Cross-Validation.

2. Hastie

Tibshirani

& Friedman

J. (2009). The Elements of Statistical Learning: Data Mining

Inference

and Prediction. Springer. In Chapter 7

Section 7.10.1 "Cross-Validation

" the authors describe K-fold cross-validation. The process involves fitting the model K times on different subsets of the training data. When combined with a grid search

this fitting process is repeated for every point in the hyperparameter grid. The independence of each model fit makes the overall process highly parallelizable.

Source: Chapter 7

"Model Assessment and Selection

" Section 7.10.1

page 242.

3. Databricks Machine Learning Documentation

"Hyperparameter tuning": The documentation explains how tools like Hyperopt with SparkTrials can "distribute runs and manage models" for hyperparameter tuning. This distribution of runs across a cluster's worker nodes is the mechanism that enables the parallel training of the multiple models generated by a grid search and cross-validation process.

Source: Databricks Documentation > Machine Learning > Models > Hyperparameter tuning.

Q: 3

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Options

Correct Answer:

Explanation

Spark ML (MLlib) is Apache Spark's scalable machine learning library. It is designed to operate natively on distributed Spark DataFrames. It provides a comprehensive set of feature engineering tools, such as VectorAssembler, StandardScaler, and OneHotEncoder, which are implemented as distributed algorithms. These tools process data in parallel across a cluster without requiring the user to write custom logic in a User-Defined Function (UDF) or use the pandas Function API. This makes Spark ML the ideal choice for performing large-scale, distributed feature engineering directly within a Spark-based machine learning pipeline.

Why Incorrect

A. Keras is a deep learning framework, not a general-purpose tool for distributed feature engineering on tabular data.

B. pandas is a single-node library and cannot perform distributed computation on its own.

C. PyTorch is a deep learning framework focused on tensor computation, not a native distributed feature engineering library for DataFrames.

E. Scikit-learn is a single-node library; using it at a distributed scale would require wrappers like pandas UDFs, which the question explicitly excludes.

References

1. Databricks Official Documentation

"Feature engineering and featurization": This document states

"You can use Spark ML to scale feature engineering... Spark ML provides a broad set of feature transformers that you can apply to your data." It then lists numerous built-in transformers like StandardScaler and VectorAssembler that operate on Spark DataFrames. (See section: "Feature engineering with MLlib").

2. Databricks Official Documentation

"Single-node and distributed training": This page contrasts single-node libraries with distributed ones. It clarifies

"For big data

you can use Spark’s machine learning library

MLlib... MLlib is a distributed library and is the only ML library that comes pre-installed on Databricks Runtimes." This highlights that Spark ML is the native distributed tool

unlike Scikit-learn. (See section: "Distributed training").

3. Karau

Konwinski

Wendell

& Zaharia

M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media

Inc. Chapter 6

"Machine Learning with MLlib

" describes the library's architecture

stating

"MLlib is Spark’s library of machine learning algorithms... The feature extraction and transformation utilities in MLlib help you create features from your raw data." The chapter details transformers that are inherently distributed. (See Chapter 6

"Feature Extraction and Transformation").

Q: 4

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space. As a result, they have the following code block: Databricks Machine Learning Associate question

Databricks Machine Learning Associate question

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Options

Correct Answer:

Explanation

The provided code uses SparkTrials to parallelize the hyperparameter tuning process. SparkTrials is designed to distribute trials across a Spark cluster, where each trial runs as a separate Spark task. This is ideal for tuning single-node models (e.g., scikit-learn).

However, the question states the objective function wraps a Spark ML model. Spark ML models are inherently distributed and launch their own Spark jobs for training. Using SparkTrials with a Spark ML model would attempt to launch a Spark job from within another Spark task, a practice known as nested parallelism, which is not supported in Spark and can lead to deadlocks or failures.

The correct pattern for tuning Spark ML models is to run the tuning process on the driver node using the standard Trials() class. The driver will iterate through hyperparameter sets, and for each set, the Spark ML model training will be launched as a regular Spark job, utilizing the cluster as intended.

Why Incorrect

B. Reducing numevals only changes the number of tuning runs; it does not fix the fundamental architectural conflict of nested Spark jobs.

C. Changing fmin() to fmax() modifies the optimization objective (minimization vs. maximization) but does not address the execution model issue.

D. Removing the trials=trials argument would cause fmin to use a default Trials object, but the explicit and correct change is to replace SparkTrials with Trials.

E. Removing the algo=tpe.suggest argument changes the search algorithm but does not resolve the unsupported use of SparkTrials with a Spark ML model.

References

1. Databricks Official Documentation

"Hyperparameter tuning with Hyperopt": In the section on best practices

the documentation explicitly states the correct usage pattern.

Reference: Databricks Machine Learning Guide > Hyperparameter tuning > Best practices for Hyperopt.

Quote: "To tune Spark MLlib models

use Hyperopt on the driver node only; that is

do not use SparkTrials. For Spark MLlib models

Hyperopt generates hyperparameters on the driver and then Spark MLlib’s model fitting and evaluation algorithms use Spark to distribute the computation." This directly supports changing SparkTrials() to Trials() (which runs on the driver).

Quote: "When using SparkTrials

do not pass algorithms that require SparkContext to the objective function. For example

do not use CrossValidator or TrainValidationSplit to evaluate Spark ML models. These methods require running nested Spark jobs

which is not supported." This explains the underlying technical reason why the original code is incorrect.

Q: 5

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult. Which of the following describes why?

Options

Correct Answer:

Explanation

Gradient Boosting is an ensemble learning technique that builds models in a sequential, stage-wise fashion. Each new model (typically a decision tree) is trained to correct the errors or residuals of the ensemble of all previously trained models. This means the construction of the second tree depends entirely on the outcome of the first tree, the third tree depends on the first two, and so on. This inherent sequential dependency, where each iteration requires the output from the previous one, is the fundamental reason why the overall tree-building process cannot be easily parallelized.

Why Incorrect

A. Parallelization is a general computing concept not restricted to linear algebra; many non-linear algorithms can be parallelized.

B. Many parallel algorithms work by distributing data across nodes; needing access to all data does not inherently prevent parallelization.

C. The calculation of gradients across data points can often be parallelized; this is not the primary bottleneck for the overall algorithm.

References

1. Official Vendor Documentation (Apache Spark

the foundation of Databricks ML): In the MLlib programming guide for Gradient-Boosted Trees (GBTs)

the documentation states: "GBTs train decision trees one by one

where each new tree helps to correct the errors of the previously trained ensemble of trees. The training of each tree is dependent on the previously trained trees." This highlights the sequential nature of the algorithm.

Source: Apache Spark 3.5.0 MLlib Guide

"Classification and regression - Gradient-boosted trees (GBTs)

" Algorithm section.

2. Academic Publication: The canonical textbook "The Elements of Statistical Learning" describes the gradient boosting algorithm as a forward stagewise procedure. The algorithm is presented as a loop from m=1 to M

where each step m explicitly uses the model f{m-1}(x) from the preceding step to compute the residuals and fit the next base learner hm(x). This iterative dependency is fundamental to the method.

Source: Hastie

Tibshirani

& Friedman

J. (2009). The Elements of Statistical Learning: Data Mining

Inference

and Prediction. Springer. Chapter 10

"Boosting and Additive Trees

" Algorithm 10.3

page 359.

3. Academic Publication: The paper introducing XGBoost

a highly optimized gradient boosting implementation

discusses this challenge. It explains that while the inter-tree creation is sequential

they achieve scalability by parallelizing the process within each tree's construction

specifically the split-finding part. This distinction confirms that the overall boosting process is iterative and not parallelizable across trees.

Source: Chen

& Guestrin

C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Section 2.1

"Regularized Learning Objective." DOI: https://doi.org/10.1145/2939672.2939785

Q: 6

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data. Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

Options

Correct Answer:

Explanation

The pandas API on Spark is specifically designed to provide a familiar, pandas-like interface that executes on a distributed Apache Spark cluster. This allows data scientists to scale their existing pandas workloads with minimal code changes. Often, the primary modification involves changing the import statement from import pandas as pd to import pyspark.pandas as ps. This approach directly addresses the requirement to spend the "least amount of time refactoring" because it leverages existing code and knowledge of the pandas API, unlike rewriting the logic using the PySpark, Scala, or SQL APIs, which have different syntax and execution paradigms.

Why Incorrect

A. This describes the goal of using Spark (parallel processing) but is not a specific, actionable refactoring approach or API.

B. Refactoring to the native PySpark DataFrame API would require a significant rewrite, as its syntax and methods differ from the pandas API.

C. This would require rewriting the entire notebook in a different programming language (Scala), representing the most time-consuming refactoring effort.

D. Translating programmatic pandas operations into declarative Spark SQL queries is a fundamentally different approach and requires a complete code rewrite.

References

1. Databricks Documentation

"What is pandas API on Spark?": "Pandas API on Spark makes data scientists productive with big data

by allowing them to use the pandas API they are familiar with to work with terabyte-scale datasets... with minimal code change."

Source: Databricks Documentation > Apache Spark > Pandas API on Spark > Overview.

2. Apache Spark Documentation

"Pandas API on Spark": "It provides pandas-equivalent APIs that work on Apache Spark. Pandas API on Spark is useful for users who are already familiar with pandas and want to leverage Spark for big data."

Source: Apache Spark™ 3.5.0 Documentation > PySpark > Pandas API on Spark.

3. Databricks Blog

"Koalas: Easy Transition from pandas to Apache Spark": "The Koalas project makes data scientists more productive when interacting with big data

by implementing the pandas DataFrame API on top of Apache Spark... Data scientists can now make a seamless transition from a single machine to a distributed environment."

Source: Databricks Blog

January 24

2020

"Koalas: Easy Transition from pandas to Apache Spark".

Q: 7

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature. Which of the following lines of code can the data scientist run to accomplish the task?

Options

Correct Answer:

Explanation

The summary() method on a PySpark DataFrame is specifically designed to compute a rich set of aggregate statistics. For numerical columns, it calculates the count, mean, standard deviation, minimum, maximum, and approximate quartiles (25%, 50%, 75%). The interquartile range (IQR) is the difference between the 75th and 25th percentiles, which are both provided by the output of the summary() function. Therefore, this single command directly provides all the information requested by the data scientist.

Why Incorrect

B. sparkdf.stats(): This is not a valid method for a Spark DataFrame. The correct attribute to access statistical functions is .stat, not .stats().

C. sparkdf.describe().head(): The .describe() method computes count, mean, standard deviation, min, and max, but it does not compute the percentiles required to find the interquartile range (IQR).

D. sparkdf.printSchema(): This method only displays the schema of the DataFrame (column names and data types) and does not compute any summary statistics.

E. sparkdf.toPandas(): This action converts the Spark DataFrame into a pandas DataFrame. It does not compute the statistics itself; it is a data collection step that precedes any analysis.

References

1. Apache Spark 3.5.0 Documentation

pyspark.sql.DataFrame.summary: "Computes specified statistics for numeric and string columns... If no statistics are given

this function computes count

mean

stddev

min

approximate quartiles (25%

50%

75%)

and max." This confirms that summary() provides the necessary quartiles for IQR.

Source: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.summary.html

2. Apache Spark 3.5.0 Documentation

pyspark.sql.DataFrame.describe: "Computes basic statistics for numeric and string columns... For numeric columns

the result includes count

mean

stddev

min

max." This source verifies that .describe() lacks the required percentile information.

Source: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.describe.html

3. Databricks Documentation

DataFrames: "You can use describe to see summary statistics for a DataFrame... To see more statistics

including quartiles

use the summary method." This official Databricks source explicitly differentiates between the two methods and highlights summary() for calculating quartiles.

Source: Databricks Documentation > Get started > DataFrames > Python > Summarize and visualize data. (Specific page URLs change

but the content is consistently found in the introductory DataFrame tutorials).

Q: 8

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning. Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

Options

Correct Answer:

Explanation

Databricks AutoML is designed to automate the iterative and time-consuming tasks of the machine learning model development process. This includes data preprocessing, model training, hyperparameter tuning, and evaluation. The output of an AutoML experiment is a set of trained models ranked on a leaderboard, along with generated notebooks. While AutoML facilitates the next step by allowing one-click registration of the best model into the MLflow Model Registry, the actual deployment of that model—for example, creating a real-time serving endpoint or setting up a batch inference job—is a distinct, subsequent step in the MLOps lifecycle that must be performed by the user outside of the AutoML experiment itself.

Why Incorrect

A. Model tuning: This is incorrect. A core feature of Databricks AutoML is to automatically perform hyperparameter tuning across a range of algorithms to find the best-performing model.

B. Model evaluation: This is incorrect. AutoML automatically evaluates each model trial using relevant metrics (e.g., F1 score for classification, RMSE for regression) and presents the results in a leaderboard for comparison.

D. Exploratory data analysis: This is incorrect. As part of its output, Databricks AutoML generates a data exploration notebook which includes a statistical summary and visualizations of the training dataset, thus performing a form of automated EDA.

---

References

1. Databricks Official Documentation

"How Databricks AutoML works": This document outlines the automated steps performed by AutoML. The list includes: "Prepares the dataset for model training

" "Iterates to train and tune multiple models

" "Evaluates models

" and "Provides a Python notebook with the source code... including a data exploration notebook." Model deployment is not listed as an automated step.

Source: Databricks Machine Learning Guide > AutoML > How Databricks AutoML works.

2. Databricks Official Documentation

"Model serving with Databricks": This documentation describes model deployment as a separate process that occurs after a model has been trained and registered in the Model Registry. It details the steps to create and manage serving endpoints

which is distinct from the AutoML experiment workflow.

Source: Databricks Machine Learning Guide > MLflow > Model serving with Databricks.

3. The Big Book of MLOps (Databricks Ebook)

Chapter 2

"A Modern MLOps Architecture": This official resource presents a diagram of the MLOps lifecycle. The "Build & Train" stage

which includes AutoML

is shown as separate and preceding the "Model Deployment" and "Model Monitoring" stages. This clearly delineates deployment as an activity outside of the automated training experiment.

Source: Databricks Ebooks

"The Big Book of MLOps"

Page 15

Figure 2-1.

Q: 9

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables. Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options

Correct Answer:

Explanation

For large-scale datasets, solving the linear regression problem with a direct analytical method like the normal equation (which involves matrix decomposition) is computationally infeasible. The X^T X matrix can become too large to fit in memory or too expensive to invert.

To overcome this, Spark ML's LinearRegression implementation utilizes iterative optimization algorithms. These methods, such as L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno), start with an initial guess for the model's weights and repeatedly update them in a direction that minimizes the loss function. The core computations, like calculating the gradient, can be efficiently distributed across the cluster, making this approach highly scalable for large data.

Why Incorrect

A. Logistic regression is a distinct algorithm used for classification tasks, not a method for training a linear regression model.

B. This is false. The primary purpose of Spark ML is to provide scalable, distributed implementations of machine learning algorithms, including linear regression.

D. The least-squares method describes the objective function (minimizing the sum of squared errors) that linear regression solves, not the computational strategy for distributed training.

E. Singular value decomposition is a matrix decomposition technique. The question's premise correctly states that such methods do not scale well for this problem.

References

1. Official Apache Spark Documentation: The documentation for LinearRegression in MLlib explicitly details the optimization algorithms used. It states

"The implementation is based on the MLlib LBFGS optimizer for L2-regularized linear regression... For unregularized linear regression

the implementation uses a wrapper for the NormalEquation and Cholesky solvers... The normal equation solver is limited to at most 4096 features." This confirms that for a large number of variables (beyond 4096)

the iterative L-BFGS optimizer is the method employed.

Source: Apache Spark 3.5.0 MLlib Guide

Classification and regression > Linear methods > Linear regression > Mathematical detail.

2. Academic Publication: The foundational paper on Spark's machine learning library highlights the design choice of using iterative methods for scalability. "For many ML algorithms

we can express the optimization problem as a sum of loss terms... and solve it with gradient descent. The gradient can be computed on a cluster by summing gradients computed on subsets of the data in parallel... MLlib has a general-purpose gradient descent optimizer." L-BFGS is a more advanced quasi-Newton iterative optimization method built on these principles.

Source: Meng

Bradley

Yavuz

Sparks

Venkataraman

Liu

... & Talwalkar

A. (2016). MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research

17(34)

1-7. (Section 3.1: Implementation).

3. Official Databricks Documentation: The Databricks documentation on Linear Regression also confirms the use of iterative optimization. "The training algorithm uses the L-BFGS optimizer." This directly points to an iterative optimization approach as the standard for their implementation.

Source: Databricks Documentation

Machine Learning Guide > MLflow > MLflow models > spark.mllib > Linear Regression.

Q: 10

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

Options

Correct Answer:

Explanation

The Databricks MLflow experiment UI is designed for traceability and reproducibility. When an MLflow run is initiated from a Databricks notebook, the UI automatically captures a link to that notebook. This link is displayed in the "Source" column on the experiment page. Clicking this link opens a snapshot of the notebook as it existed when the run was executed, allowing users to review the exact code that produced the logged metrics, parameters, and artifacts. This is a key feature for auditing and reproducing machine learning experiments.

Why Incorrect

A. The MLmodel artifact is a metadata file within the run's artifacts that defines the model's format and dependencies, not a link to the source notebook.

B. The "Models" link or column indicates a logged model artifact and, if registered, links to its page in the Model Registry, not the source code.

D. The "Start Time" link navigates to the detailed page for that specific run, which contains metrics and artifacts, but it is not the direct link to the source notebook from the experiment view.

---

References

1. Databricks Official Documentation

"Organize training runs with MLflow experiments":

Reference: In the section describing the experiment UI table

the documentation states for the "Source" column: "Name of the notebook that created the run... Click the link to view the source." This directly confirms that the "Source" link is the correct method.

Location: https://docs.databricks.com/en/mlflow/experiments.html#view-an-experiment

2. Databricks Official Documentation

"Tutorial: ML end-to-end on Databricks":

Reference: This tutorial guides users through the ML lifecycle on Databricks. In the section "View the experiment and run

" it explicitly instructs the user: "To view the notebook version that created the run

click the link in the Source field." This provides a practical

documented example of the correct procedure.

Location: https://docs.databricks.com/en/machine-learning/end-to-end-example.html#view-the-experiment-and-run

Q: 11

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model. The Spark DataFrame train_df has the following schema: Databricks Machine Learning Associate question

The machine learning engineer shares the following code block: Which of the following changes does the machine learning engineer need to make to complete the task?

Options

Correct Answer:

Explanation

Spark ML algorithms, including LinearRegression, require that the input features be consolidated into a single column containing vectors. The provided DataFrame schema shows individual columns for each potential feature (e.g., reviewscoresrating, bedrooms, beds). Before training the model, these individual columns must be assembled into a single feature vector column, which is conventionally named "features". This is typically accomplished using the VectorAssembler transformer. The model's fit method is then called on this transformed DataFrame, which now includes the required vector column.

Why Incorrect

A. The transform method is used to apply a trained model to generate predictions on new data, not for training the model itself.

C. Changes are mandatory because the input DataFrame's format, with separate feature columns, is incompatible with the Spark ML LinearRegression estimator's requirements.

D. While using a Pipeline is a best practice for organizing ML workflows, it is not a strict necessity. The required transformations can be applied sequentially without a Pipeline.

F. This is the opposite of the required action. Spark ML requires combining individual feature columns into a single vector column, not splitting them apart.

---

References

1. Apache Spark Official Documentation

MLlib Programming Guide

"Feature Extractors

Transformers

and Selectors": This guide explicitly describes the VectorAssembler transformer. It states

"VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector

in order to train ML models like logistic regression and decision trees." This directly supports the necessity of creating a feature vector.

Source: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

2. Databricks Official Documentation

"Tutorial: ML end-to-end on Databricks": In the "Feature engineering" section of this official tutorial

the VectorAssembler is used to combine multiple feature columns into a single features column before the model is trained. This demonstrates the standard

required workflow.

Source: https://docs.databricks.com/en/machine-learning/end-to-end-example.html

Section: "Feature engineering".

3. Apache Spark Official Documentation

MLlib Programming Guide

"Main Concepts": The "Estimator" section explains that an Estimator abstracts the concept of a learning algorithm. Its fit() method accepts a DataFrame to train a model. All provided code examples for estimators like LinearRegression or LogisticRegression show them being trained on a DataFrame that has already been preprocessed to include a "features" vector column.

Source: https://spark.apache.org/docs/latest/ml-guide-main.html#estimators

Q: 12

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options

Correct Answer:

Explanation

A pandas API on Spark DataFrame is an abstraction layer built directly on top of a native Spark DataFrame. It acts as a wrapper, utilizing the underlying distributed Spark DataFrame for data storage and computation. To provide the familiar, pandas-like API and features (such as a specific index), it maintains additional metadata alongside the Spark DataFrame. This design allows data scientists to leverage the distributed power of Spark using the well-known pandas syntax, effectively scaling their single-node workflows to big data environments without a steep learning curve.

Why Incorrect

A. pandas API on Spark DataFrames are distributed, not single-node. Their primary purpose is to enable distributed computation using a pandas-like interface.

B. They are a wrapper API and are not inherently more performant than the native, highly optimized Spark DataFrame API.

D. Both are built on Spark's immutable data structures. The pandas API on Spark does not change this fundamental characteristic.

E. They are fundamentally related; a pandas API on Spark DataFrame cannot exist without an underlying Spark DataFrame.

References

1. Apache Spark Documentation

Pandas API on Spark

Internals: "Internally

pandas API on Spark DataFrames are composed of a Spark DataFrame and an 'internal frame'. The internal frame holds the information about index and column labels to map the pandas-like API to the Spark DataFrame." This directly supports that it is made up of a Spark DataFrame and additional metadata.

Source: Apache Spark 3.5.1 Documentation

Pandas API on Spark

Internals section.

2. Databricks Documentation

Pandas API on Spark: "The pandas API on Spark provides pandas-equivalent APIs that work on Apache Spark... You can create a pandas API on Spark DataFrame by calling pyspark.pandas.frompandas or pyspark.pandas.readcsv. You can also convert to and from pandas API on Spark DataFrames and PySpark DataFrames..." This demonstrates the direct relationship and interoperability

refuting that they are unrelated (E) and confirming they are built upon Spark's foundation.

Source: Databricks Documentation > Develop on Databricks > Libraries and scripts > Pandas API on Spark.

3. Learning Spark

2nd Edition (O'Reilly)

Chapter 11: Pandas API on Spark: "The pandas API on Spark was created to provide a pandas-like API on top of Spark

so that data scientists can make an easy transition from a single-node machine to a distributed environment... Under the hood

every pandas API on Spark DataFrame is backed by a PySpark DataFrame."

Source: Chambers

& Zaharia

M. (2020). Learning Spark

2nd Edition. O'Reilly Media

Inc. Chapter 11

"What Is the pandas API on Spark?" section.

Q: 13

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model. Which of the following possible explanations for this difference is invalid?

Options

Correct Answer:

Explanation

The question asks for the invalid explanation for the observed difference in Root Mean Squared Error (RMSE) between the two models. RMSE is a standard, widely used, and fundamentally valid metric for evaluating the performance of regression models. It measures the square root of the average of squared differences between predicted and actual values, providing an estimate of the error magnitude in the units of the label. Therefore, the statement that RMSE is an invalid evaluation metric for regression problems is factually incorrect and constitutes an invalid explanation. The other options describe plausible scenarios, including a common error (B) where predictions on a transformed scale are not converted back to the original scale before evaluation.

Why Incorrect

A. This is a plausible scenario. The second model could be more accurate, but the RMSE was miscalculated (e.g., as described in B), leading to a misleadingly high value.

B. This is a very likely and valid explanation. Failing to apply the exponential function to the second model's log-scale predictions before comparing them to the original price would result in a massive, meaningless RMSE.

C. This describes an irrelevant action. There is no reason to take the log of the first model's predictions, so this does not explain the observed difference. However, it is not a fundamentally false statement about ML principles like option E.

D. This is a simple and valid possibility. The first model, which directly models price, might genuinely be a better fit for the data than the log-transformed model.

References

1. Databricks Official Documentation

pyspark.ml.evaluation.RegressionEvaluator: The official API documentation for PySpark's RegressionEvaluator class lists rmse (Root Mean Squared Error) as the default metric for evaluation. This confirms its validity and standard use within the Databricks ecosystem.

Source: Apache Spark 3.5.0 Documentation > PySpark > pyspark.ml > pyspark.ml.evaluation. Section: metricName.

2. Databricks Official Documentation

"Regression: predict house prices" Tutorial: This official Databricks tutorial builds a regression model and explicitly uses RMSE to evaluate its performance. The "Evaluate the model" section states

"First

we'll look at the root mean squared error (RMSE). This metric is the square root of the mean squared error. It is a common metric to evaluate regression models."

Source: Databricks Machine Learning Guide > Tutorials > Regression: predict house prices.

3. University Courseware

"An Introduction to Statistical Learning": This is a standard textbook in many university statistics and machine learning programs. In Chapter 2

"Statistical Learning

" Section 2.1.5

"Measuring the Quality of Fit

" the Mean Squared Error (MSE) is introduced as the primary method for assessing the accuracy of a regression model. RMSE is the square root of MSE and is used for the same purpose

often preferred because its units are the same as the response variable.

Source: James

Witten

Hastie

& Tibshirani

R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. (Page 30).

4. University Courseware

"An Introduction to Statistical Learning" on Transformations: In Chapter 3

"Linear Regression

" Section 3.3.3

"Other Considerations in the Regression Model

" the text discusses addressing non-linearity by transforming variables. A common approach is replacing the response Y with log(Y). When making predictions with such a model

one must remember to transform the prediction back to the original scale for interpretation and evaluation

which supports why option (B) is a valid potential explanation for error.

Source: James

Witten

Hastie

& Tibshirani

R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. (Page 93).

Q: 14

A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block: Databricks Machine Learning Associate question

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set. Which of the following changes can the data scientist make to address the concern?

Options

Correct Answer:

Explanation

The primary concern is data leakage, which occurs when information from the test set is used to train the model. Standardizing the entire dataset before splitting allows the scaler to learn the mean and standard deviation from the test data, which then influences the transformation of the training data. The correct procedure is to split the data first, then fit the scaler only on the training data. This fitted scaler, which now contains the summary statistics of the training data, is then used to transform both the training and the test sets. The Spark ML Pipeline API is designed to correctly sequence these operations, ensuring that estimators (like StandardScaler) are fitted only on the training data during the pipeline.fit() call, and the resulting transformer is then applied to any dataset, including the test set.

Why Incorrect

A. Using global minimum and maximum values from the entire dataset is the definition of data leakage, which is the problem that needs to be solved.

B. This is also incorrect as it uses global statistics, leading to data leakage from the test set into the transformation process.

C. Cross-validation is a model validation technique, not a substitute for feature scaling. Scaling must still be performed correctly within each fold of the cross-validation process.

D. Standardizing the training data using statistics from the test data is a severe and fundamentally incorrect form of data leakage.

References

1. Databricks Official Documentation

"ML end-to-end example": This tutorial demonstrates the correct workflow. A Pipeline containing feature engineering stages is created. The documentation shows the pipeline being fitted only on the training data (pipelinemodel = pipeline.fit(traindf))

and then this fitted model is used to transform the test data (preddf = pipelinemodel.transform(testdf)). This directly supports the methodology described in the correct answer. (See: Databricks Documentation -> Machine Learning -> Tutorials -> ML end-to-end example -> Section: "Create a machine learning model").

2. Apache Spark Official Documentation

"ML Pipelines": The documentation explains that when a Pipeline.fit() method is called

it calls fit() on each Estimator stage (like StandardScaler) in sequence. The resulting Transformer becomes part of the PipelineModel. This PipelineModel can then be used to transform() new data. This confirms that the statistics for scaling are learned only from the data passed to fit()

which should be the training set. (See: Apache Spark 3.5.0 Documentation -> MLlib: Machine Learning Library -> ML Pipelines -> Section: "How it works").

3. Hastie

Tibshirani

& Friedman

J. (2009). The Elements of Statistical Learning. Springer. In Chapter 7

Section 10.2

"The Wrong and Right Way to Do Cross-validation

" the authors explicitly warn against this error. They state

"The test data should be strictly held out from all aspects of the model fitting

including feature scaling. Any preprocessing steps that use data

such as computing means and variances for standardization

must be learned from the training data only and then applied to the test data." This foundational principle of machine learning directly applies to a train-test split.

Q: 15

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically. Which of the following lines of code will return the metadata description?

Options

Correct Answer:

Explanation

The Databricks Feature Store Client (fs) provides the gettable() method to retrieve a handle to a specific Feature Table. This method returns a Table object, which contains various metadata attributes associated with the table. The description attribute of this Table object holds the string that was provided as the description during the table's creation. Therefore, chaining .description to the fs.gettable("newtable") call directly accesses and returns this metadata string programmatically.

Why Incorrect

A. This is incorrect. The Feature Store Python API is designed to manage all aspects of a feature table programmatically, including retrieving its metadata.

B. The createtrainingset() method is used to join features from feature tables with a label DataFrame to create a dataset for model training, not to retrieve metadata.

D. The loaddf() method is called on a Table object to load the feature data itself into a Spark DataFrame, not to access its metadata description.

E. This line of code returns the entire Table object, not the specific description string. The question asks for the code that returns the description itself.

References

1. Databricks Official Documentation

"Manage feature tables": This document provides examples of interacting with feature tables. Under the section "Get a feature table"

it explicitly shows the code featuretable = fs.gettable(tablename) followed by print(featuretable.description) to access the description.

Source: Databricks Machine Learning Guide > Feature Store > Manage feature tables.

2. Databricks Official Documentation

"Feature Store Python API": The API reference details the FeatureStoreClient.gettable method

which returns a Table object. The documentation for the Table class lists description as a public attribute.

Source: Databricks API reference > Machine Learning APIs > Feature Store Python API > databricks.featurestore.client.FeatureStoreClient.gettable and the Table object definition.

Question 1 of 15