Study Smarter for the Machine Learning Associate Exam with Our Free and Accurate Machine Learning Associate Exam Questions – Updated for 2025.
At Cert Empire, we are committed to providing the most reliable and up-to-date exam questions for students preparing for the Databricks Machine Learning Associate Exam. To help learners study more effectively, we’ve made sections of our Machine Learning Associate exam resources free for everyone. You can practice as much as you want with Free Machine Learning Associate Practice Test.
Databricks Machine Learning Associate
Q: 1
An organization is developing a feature repository and is electing to one-hot encode all categorical
feature variables. A data scientist suggests that the categorical feature variables should not be one-
hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
Options
Q: 2
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing
model hyperparameters via grid search for a classification problem:
● Hyperparameter 1: [2, 5, 10]
● Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in
parallel during this process?
Options
Q: 3
Which of the following tools can be used to distribute large-scale feature engineering without the
use of a UDF or pandas Function API for machine learning pipelines?
Options
Q: 4
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have
wrapped a Spark ML model in the objective function objective_function and they have defined the
search space search_space.
As a result, they have the following code block:
Which of the following changes do they need to make to the above code block in order to accomplish
the task?
Which of the following changes do they need to make to the above code block in order to accomplish
the task?Options
Q: 5
A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the
training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.
Which of the following describes why?
Options
Q: 6
A data scientist has written a data cleaning notebook that utilizes the pandas library, but their
colleague has suggested that they refactor their notebook to scale with big data.
Which of the following approaches can the data scientist take to spend the least amount of time
refactoring their notebook to scale with big data?
Options
Q: 7
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data
scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile
range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?
Options
Q: 8
A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine
Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML
experiment?
Options
Q: 9
The implementation of linear regression in Spark ML first attempts to solve the linear regression
problem using matrix decomposition, but this method does not scale well to large datasets with a
large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression
model for large data?
Options
Q: 10
Which of the following approaches can be used to view the notebook that was run to create an
MLflow run?
Options
Q: 11
A machine learning engineer would like to develop a linear regression model with Spark ML to
predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.
The Spark DataFrame train_df has the following schema:
The machine learning engineer shares the following code block:
Which of the following changes does the machine learning engineer need to make to complete the
task?
The machine learning engineer shares the following code block:
Which of the following changes does the machine learning engineer need to make to complete the
task?Options
Q: 12
Which of the following describes the relationship between native Spark DataFrames and pandas API
on Spark DataFrames?
Options
Q: 13
A data scientist has created two linear regression models. The first model uses price as a label
variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each
model by comparing the label predictions to the actual price values, the data scientist notices that
the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?
Options
Q: 14
A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:
Upon code review, a colleague expressed concern with the features being standardized prior to
splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?
Upon code review, a colleague expressed concern with the features being standardized prior to
splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?Options
Q: 15
A machine learning engineer has created a Feature Table new_table using Feature Store Client fs.
When creating the table, they specified a metadata description with key information about the
Feature Table. They now want to retrieve that metadata programmatically.
Which of the following lines of code will return the metadata description?
Options
Question 1 of 15