Free Practice Test

Free Machine Learning Associate Practice Exam – 2025 Updated

Study Smarter for the Machine Learning Associate Exam with Our Free and Accurate Machine Learning Associate Exam Questions โ€“ Updated for 2025.

At Cert Empire, we are committed to providing the most reliable and up-to-date exam questions for students preparing for the Databricks Machine Learning Associate Exam. To help learners study more effectively, weโ€™ve made sections of our Machine Learning Associate exam resources free for everyone. You can practice as much as you want with Free Machine Learning Associate Practice Test.

Question 1

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipelineโ€™s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day. Which approach should the data scientist take to complete this task?
Options
A: They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.
B: They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.
C: They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.
D: They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.
Show Answer
Correct Answer:
They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.
Explanation
The project is managed within a Databricks Repo, which is integrated with a Git provider. The standard and best practice for developing new features or making changes without disrupting the main production code is to use a new Git branch. This isolates the development work from the main branch, which the scheduled Job is presumably running from. The data scientist can create a branch, commit their changes, and push them to the remote Git provider. Once the work is complete and tested, it can be merged into the main branch through a pull request, ensuring a controlled and reviewable process for adoption into the daily job.
Why Incorrect Options are Wrong

B. Cloning notebooks to a Workspace folder removes them from Git version control, making integration of changes difficult and breaking the established project workflow.

C. Creating an entirely new Git repository disconnects the work from the original project's history and makes merging changes back a manual, complex process.

D. Cloning into a new Databricks Repo creates a separate project fork, which is an unnecessarily complex approach compared to using a branch for feature development.

References

1. Databricks Official Documentation, "Git operations with Databricks Repos": This document outlines the standard Git workflow within Databricks. The section "Create a new branch" explicitly describes the procedure for this task: "You can create a new branch based on an existing branch from within a repo... This is the best practice for developing your new work." This directly supports the methodology in option A as the recommended approach for isolated development.

Source: Databricks Documentation, docs/en/repos/git-operations-with-repos.html, Section: "Create a new branch".

2. Databricks Official Documentation, "CI/CD techniques with Git and Databricks Repos": This guide on best practices for development workflows emphasizes using separate Git branches for development work ("feature branches") to isolate changes from the main production branch. It states, "A common workflow is to create a new feature branch for your work... You can make your changes, and then commit and push them to the Git provider." This aligns perfectly with the scenario of improving a feature without affecting the daily production job.

Source: Databricks Documentation, docs/en/repos/ci-cd-repos.html, Section: "Development workflow".

Question 2

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run. Which of the following approaches can the team use to identify which task is the cause of the failure?
Options
A: Run each notebook interactively
B: Review the matrix view in the Job's runs
C: Migrate the Job to a Delta Live Tables pipeline
D: Change each Taskโ€™s setting to use a dedicated cluster
Show Answer
Correct Answer:
Review the matrix view in the Job's runs
Explanation
The Databricks Jobs UI provides a detailed history for each job run. The matrix view (also known as the Gantt view) within a specific job run visually displays the status of each task in the job. This view clearly indicates which tasks succeeded, which are running, and, most importantly, which one failed, along with its duration and start/end times. This is the primary and most efficient method for pinpointing the exact point of failure in a multi-task job without needing to re-run code or change the job's configuration.
Why Incorrect Options are Wrong

A. Running notebooks interactively is a manual debugging step performed after identifying the failed task, not the method to identify it.

C. Migrating to Delta Live Tables is a major architectural change and is not a tool for diagnosing a standard Databricks Job failure.

D. Using dedicated clusters is a configuration change to prevent future failures, not a method to diagnose a past failed run.

References

1. Databricks Official Documentation, "View job runs": This document describes how to monitor job runs. It states, "To view the run history for a job, click the job name in the Jobs list... You can view the matrix or Gantt chart of job runs and a list of job runs." The matrix view visually distinguishes between successful and failed tasks. (See section: "View the runs for a job").

2. Databricks Official Documentation, "Troubleshoot and fix job failures": This guide explicitly recommends using the job run history to diagnose failures. It states, "If a job fails, you can investigate the cause of the failure by viewing the jobโ€™s run history... The job run details page shows which tasks ran successfully and which failed." (See section: "View job run details to determine the cause of a job failure").

Question 3

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML. Which of the following compute tools is best suited for this use case?
Options
A: Single Node cluster
B: Standard cluster
C: SQL Warehouse
D: None of these compute tools support this task
Show Answer
Correct Answer:
Standard cluster
Explanation
The described workload involves both data manipulation using Spark SQL and model training using Spark ML. This combination is a typical data science task that requires a general-purpose, interactive environment capable of running both SQL and arbitrary code (like Python for Spark ML). A Standard cluster (also known as an all-purpose cluster) is specifically designed for these interactive data science and data engineering workloads. It allows data scientists to attach notebooks and execute a mix of commands, including Spark SQL for data preparation and Spark ML for distributed model training, making it the best-suited compute tool for this scenario.
Why Incorrect Options are Wrong

A. Single Node cluster: This cluster has no worker nodes and is not designed for distributed computing, which is the primary advantage of using Spark ML for scalable machine learning tasks.

C. SQL Warehouse: This compute resource is optimized specifically for running SQL queries for business intelligence (BI) and analytics. It cannot be used to execute general-purpose code or machine learning libraries like Spark ML.

D. None of these compute tools support this task: This is incorrect because a Standard cluster is the designated and appropriate compute resource for this exact combination of tasks on the Databricks platform.

---

References

1. Databricks Official Documentation, "What is Databricks compute?": This document distinguishes between different compute types. It states, "For data science and data engineering workloads, you can use either all-purpose compute or job compute." It further clarifies that SQL warehouses are for "running SQL queries with Databricks SQL." This directly supports using a Standard (all-purpose) cluster for a data science workload involving Spark ML and explicitly excludes SQL Warehouses.

Source: Databricks Documentation > Compute > Get started > What is Databricks compute?

2. Databricks Official Documentation, "Cluster modes": This page details the different cluster modes. It describes the "Standard" mode as the recommended option for single users, capable of running workloads in various languages (Python, R, Scala, SQL). It contrasts this with the "Single Node" mode, which has no workers and is not suitable for large-scale Spark jobs.

Source: Databricks Documentation > Compute > Configuration > Cluster modes.

3. Databricks Official Documentation, "What is a SQL warehouse?": This source defines the purpose of a SQL Warehouse. It states, "A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL." This confirms its specialization for SQL and BI tools, not for programmatic machine learning tasks found in notebooks.

Source: Databricks Documentation > SQL > Get started > What is a SQL warehouse?

Question 4

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the path model_uri for the DataFrame batch_df. batch_df has the following schema: customer_id STRING The machine learning engineer runs the following code block to perform inference on batch_df using the linear regression model at model_uri: Databricks Machine Learning Associate exam question In which situation will the machine learning engineerโ€™s code block perform the desired inference?
Options
A: When the Feature Store feature set was logged with the model at model_uri
B: When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark
C: When the model at model_uri only uses customer_id as a feature
D: This code block will not perform the desired inference in any situation.
E: When all of the features used by the model at model_uri are in a single Feature Store table
Show Answer
Correct Answer:
When the Feature Store feature set was logged with the model at model_uri
Explanation
The code attempts to generate predictions on a DataFrame (batchdf) that contains only a primary key (customerid). For a typical machine learning model that requires multiple feature columns, this would fail. However, the Databricks Feature Store provides a specific mechanism to handle this scenario. By using featurestore.logmodel(), a model is packaged with metadata about the features it was trained on. When this "feature-aware" model is used for inference via mlflow.pyfunc.sparkudf, it automatically uses the provided keys (customerid) to look up the corresponding feature values from the Feature Store, joins them, and then computes the prediction. This is the only scenario among the options where the code will execute as intended.
Why Incorrect Options are Wrong

B. The code as written does not reference or perform any joins with other Spark DataFrames, so their mere existence in the environment is irrelevant.

C. A linear regression model requires numerical features for computation. A string identifier like customerid is unsuitable as the sole feature and would likely cause a type error.

D. This is incorrect because the integration between MLflow and the Databricks Feature Store (as described in option A) provides a valid and common pattern for this code to work.

E. While the features must exist in the Feature Store, the critical condition is that the model was logged with the feature lookup metadata, making it "feature-aware".

References

1. Databricks Official Documentation, "Train models and perform batch inference with Feature Store":

Section: "Perform batch inference"

Content: This section explicitly states: "To score a model on new data, use the featurestore.scorebatch method... This method looks up the features for the data in df from the feature tables specified in the featurelookups used when the model was logged...". The mlflow.pyfunc.sparkudf leverages this underlying capability for models logged with feature metadata. This directly supports that logging the feature set with the model is the required condition.

2. Databricks Official Documentation, "Feature Store Python API reference":

Section: databricks.featurestore.client.FeatureStoreClient.logmodel

Content: The documentation for this function explains that it "Packages the model with feature metadata." This metadata is precisely what enables the automatic feature lookup during inference, which is the core principle making the code in the question work.

3. Databricks Official Documentation, "What is a feature store?":

Section: "Model training and inference with Feature Store"

Content: The documentation illustrates the MLOps lifecycle, showing that for inference, the model can be provided with only primary keys. The model then "retrieves the precomputed features from the feature store" before making a prediction. This confirms the pattern described in the correct answer.

Question 5

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?
Options
A: F1
B: R-squared
C: MAE
D: MSE
Show Answer
Correct Answer:
F1
Explanation
The F1 score is an evaluation metric used for classification problems, not regression. It is calculated as the harmonic mean of precision and recall, which are metrics based on the counts of true positives, false positives, and false negatives from a confusion matrix. These concepts are fundamentally tied to predicting discrete class labels. Regression problems, in contrast, predict continuous numerical values. Therefore, metrics like R-squared, Mean Absolute Error (MAE), and Mean Squared Error (MSE) are appropriate as they measure the magnitude of the error between the predicted and actual continuous values.
Why Incorrect Options are Wrong

B. R-squared: This is a standard regression metric measuring the proportion of variance in the target variable that is predictable from the features.

C. MAE: Mean Absolute Error is a common regression metric that measures the average magnitude of the errors in a set of predictions.

D. MSE: Mean Squared Error is a widely used regression metric that measures the average of the squared differences between predicted and actual values.

References

1. Databricks Official Documentation, "Regression and forecasting: model metrics": This document explicitly lists the metrics calculated for each run in a Databricks AutoML regression experiment. The primary metric is root mean squared error (RMSE), and other generated metrics include R-squared, MAE, and MSE. The F1 score is not mentioned for regression.

Source: Databricks Machine Learning Guide > AutoML > Reference > Regression and forecasting: model metrics.

2. Databricks Official Documentation, "Classification: model metrics": This document lists the metrics for AutoML classification experiments, which include F1 score, accuracy, log loss, precision, and recall. This confirms that F1 is a classification metric within the Databricks ecosystem.

Source: Databricks Machine Learning Guide > AutoML > Reference > Classification: model metrics.

3. Stanford University, CS229 Machine Learning Course Notes: In the course materials covering evaluation metrics, a clear distinction is made. Metrics for classification include accuracy, precision, recall, and F1 score. Metrics for regression include Mean Squared Error (MSE) and Mean Absolute Error (MAE).

Source: Ng, A. (2023). CS229 Machine Learning Course Notes, Stanford University, "Part V: Learning Theory" and "Part VI: Evaluation Metrics".

Question 6

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric columnโ€™s median value. They have developed the following code block to accomplish this task: Databricks Machine Learning Associate exam question The code block is not accomplishing the task. Which reasons describes why the code block is not accomplishing the imputation task?
Options
A: It does not impute both the training and test data sets.
B: The inputCols and outputCols need to be exactly the same.
C: The fit method needs to be called instead of transform.
D: It does not fit the imputer on the data to create an ImputerModel.
Show Answer
Correct Answer:
It does not fit the imputer on the data to create an ImputerModel.
Explanation
The code fails because it does not follow the standard Spark ML Estimator and Transformer design pattern. The Imputer class is an Estimator, which learns from data. It must first be called with the .fit() method on the DataFrame (featuresdf). This process calculates the median for each specified column and returns a fitted ImputerModel object. This ImputerModel is a Transformer, which can then be used to apply the learned transformation (i.e., fill missing values) to a DataFrame using the .transform() method. The provided code incorrectly attempts to call .transform() directly on the unfitted Imputer estimator, skipping the crucial fit step.
Why Incorrect Options are Wrong

A. The question does not mention a test set. The code fails to impute any data, making the train/test distinction irrelevant to the core error.

B. Setting inputCols and outputCols to the same list of columns is a valid pattern in Spark ML used to overwrite the original columns with the imputed values.

C. The .fit() method does not replace the .transform() method. The correct workflow requires calling .fit() first to create a model, and then calling .transform() on that model.

References

1. Apache Spark Official Documentation, MLlib Guide, Feature Extractors, Transformers, and Selectors: "An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. It is implemented as a class that has a fit() method. The fit() method accepts a DataFrame and returns a Model, which is a Transformer."

Source: Apache Spark 3.5.0 Documentation, MLlib > ML Pipelines > Main Concepts in Pipelines.

2. Databricks Official Documentation, Impute missing values: The official example code demonstrates the correct two-step process. First, an Imputer is instantiated. Second, the .fit() method is called on the DataFrame to create a model. Third, the .transform() method is called on the resulting model.

Source: Databricks Documentation > Machine Learning > Feature engineering > Impute missing values. The example code clearly shows model = imputer.fit(df) followed by model.transform(df).show().

3. Apache Spark Official Documentation, pyspark.ml.feature.Imputer API: The API documentation specifies that Imputer is an Estimator with a fit() method that returns an ImputerModel. The ImputerModel is a Transformer that possesses the transform() method. This confirms the required sequence of operations.

Source: Apache Spark 3.5.0 Documentation, pyspark.ml.feature Module > Imputer class.

Question 7

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFrame features_df. A list of the names of the string columns is assigned to the input_columns variable. They have developed this code block to accomplish this task: Databricks Machine Learning Associate exam question The code block is returning an error. Which of the following adjustments does the data scientist need to make to accomplish this task?
Options
A: They need to specify the method parameter to the OneHotEncoder.
B: They need to remove the line with the fit operation.
C: They need to use Stringlndexer prior to one-hot encodinq the features.
D: They need to use VectorAssembler prior to one-hot encoding the features.
Show Answer
Correct Answer:
They need to use Stringlndexer prior to one-hot encodinq the features.
Explanation
The pyspark.ml.feature.OneHotEncoder in Spark ML is designed to transform columns of category indices into columns of binary vectors. It cannot operate directly on columns containing string values. The correct and standard procedure is to first apply a StringIndexer to the categorical string columns. The StringIndexer converts the string labels into numerical indices. These resulting index columns can then be fed into the OneHotEncoder to produce the desired one-hot encoded vectors. The original code block omits this mandatory indexing step, causing the error.
Why Incorrect Options are Wrong

A. The method parameter for OneHotEncoder is deprecated since Spark 3.0 and is not the source of the error, which is related to the input data type.

B. The .fit() method is essential for an Estimator like OneHotEncoder. It learns the number of categories from the data to correctly build the model for transformation.

D. VectorAssembler is used to combine multiple feature columns into a single vector column, a step that is typically performed after one-hot encoding, not before.

References

1. Databricks Official Documentation, "One-hot encoder": "One-hot encoder maps a column of category indices to a column of binary vectors... The input columns must be of numeric type." This statement confirms that the input cannot be a string and must be a category index.

Source: Databricks Documentation > Docs > Machine Learning > Feature engineering > Feature transformers > One-hot encoder.

2. Apache Spark 3.4.1 MLlib Guide, "Feature Transformers - OneHotEncoder": "It is common to have StringIndexer compute the label indices and then OneHotEncoder to encode the indexed labels." This explicitly describes the required two-step process.

Source: Apache Spark Documentation > Spark 3.4.1 > MLlib > MLlib Programming Guide > Feature Extraction, Transformation and Selection > OneHotEncoder.

3. Databricks Official Documentation, "StringIndexer": The example code on this page demonstrates the standard pipeline where a StringIndexer is applied first, followed by a OneHotEncoder, confirming the correct sequence of operations.

Source: Databricks Documentation > Docs > Machine Learning > Feature engineering > Feature transformers > StringIndexer.

Question 8

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration. Which of the following lines of code can the data scientist run to accomplish the task?
Options
A: spark_df.describe()
B: dbutils.data(spark_df).summarize()
C: This task cannot be accomplished in a single line of code.
D: spark_df.summary()
E: dbutils.data.summarize (spark_df)
Show Answer
Correct Answer:
dbutils.data.summarize (spark_df)
Explanation
The dbutils.data.summarize(sparkdf) command is a specific Databricks utility designed for exploratory data analysis. When executed within a Databricks notebook, it generates an interactive visualization of the DataFrame's summary statistics. This output includes histograms for numeric columns, frequency plots for categorical columns, and counts of null values, directly fulfilling the data scientist's requirement for visual histograms in a single line of code. Standard Apache Spark functions like .describe() and .summary() only return a DataFrame containing statistical text values, not visual plots.
Why Incorrect Options are Wrong

A. sparkdf.describe(): This function returns a DataFrame with summary statistics (count, mean, stddev, min, max) but does not generate any visual plots or histograms.

B. dbutils.data(sparkdf).summarize(): This is syntactically incorrect. The DataFrame should be passed as an argument to the summarize function, not chained in this manner.

C. This task cannot be accomplished in a single line of code: This is incorrect because dbutils.data.summarize(sparkdf) accomplishes the task in one line.

D. sparkdf.summary(): Similar to .describe(), this function returns a DataFrame with summary statistics but does not produce visual histograms.

References

1. Databricks Official Documentation, "Data utility (dbutils.data)": This document explicitly describes the summarize command. It states, "The summarize command performs a summary of the columns of a Spark DataFrame and returns the result in a structured and visual way." The page includes an example output showing the generated histograms.

Reference: Databricks Documentation > Reference > Databricks utilities (dbutils) > Data utility (dbutils.data).

2. Apache Spark Official Documentation, pyspark.sql.DataFrame.describe: This API documentation confirms that the describe() method "Computes basic statistics for numeric and string columns" and returns a new DataFrame with these statistics, with no mention of visualization capabilities.

Reference: Apache Spark 3.5.0 > PySpark API Reference > pyspark.sql > pyspark.sql.DataFrame.describe.

3. Apache Spark Official Documentation, pyspark.sql.DataFrame.summary: This API documentation details that the summary() method computes specified statistics for columns and returns them as a DataFrame. It is a more advanced version of describe() but still does not generate visual output.

Reference: Apache Spark 3.5.0 > PySpark API Reference > pyspark.sql > pyspark.sql.DataFrame.summary.

Question 9

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process. Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?
Options
A: fmin
B: SparkTrials
C: quniform
D: search_space
E: objective_function
Show Answer
Correct Answer:
SparkTrials
Explanation
Hyperopt's SparkTrials class is specifically designed to enable distributed, parallel hyperparameter tuning on a Databricks cluster. When an instance of SparkTrials is passed to the fmin function, Hyperopt leverages the underlying Spark framework to execute multiple tuning trials concurrently across the worker nodes. The degree of parallelism can be explicitly controlled by the parallelism parameter when initializing SparkTrials. This allows data scientists to significantly accelerate the search for optimal hyperparameters by evaluating multiple configurations at the same time, which is not possible with the default, serial Trials class.
Why Incorrect Options are Wrong

A. fmin: This is the main optimization function that runs the tuning loop, but it requires a SparkTrials object to actually execute the trials in parallel.

C. quniform: This is a function used within the search space to define a quantized uniform distribution for a hyperparameter; it does not control execution.

D. searchspace: This is a dictionary that defines the hyperparameters and their distributions to be tested; it is not a tool for parallelization.

E. objectivefunction: This user-defined function evaluates a single set of hyperparameters; it does not manage the parallel execution of multiple evaluations.

References

1. Databricks Official Documentation, "Parallelize hyperparameter tuning with scikit-learn and MLflow": This document explicitly states, "To parallelize tuning, use the SparkTrials class. SparkTrials takes one argument, parallelism... Hyperopt evaluates this many trials in parallel." It provides a clear code example showing fmin being used with a SparkTrials object. (Databricks Machine Learning Guide > Hyperparameter tuning > Scikit-learn, MLflow, and automated MLflow tracking > Parallelize hyperparameter tuning with scikit-learn and MLflow).

2. Databricks Official Documentation, "Hyperopt concepts": In the section describing the components of a Hyperopt workflow, it defines SparkTrials as the class to "Use to scale up tuning. SparkTrials distributes trials to Spark workers." (Databricks Machine Learning Guide > Hyperparameter tuning > Hyperopt concepts).

Question 10

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem: โ— Hyperparameter 1: [2, 5, 10] โ— Hyperparameter 2: [50, 100] Which of the following represents the number of machine learning models that can be trained in parallel during this process?
Options
A: 3
B: 5
C: 6
D: 18
Show Answer
Correct Answer:
18
Explanation
The total number of models to be trained is determined by multiplying the number of hyperparameter combinations by the number of cross-validation folds. First, calculate the number of hyperparameter combinations from the grid search: (Values for Hyperparameter 1) ร— (Values for Hyperparameter 2) = 3 ร— 2 = 6 combinations. Next, for each of these 6 combinations, 3-fold cross-validation is performed. This requires training a separate model for each fold. Total models = 6 combinations ร— 3 folds = 18 models. Since each of these 18 model training runs is an independent task, a distributed computing platform like Databricks can execute all of them in parallel, assuming sufficient cluster resources are available.
Why Incorrect Options are Wrong

A. 3: This represents only the number of cross-validation folds, not the total number of models trained across all hyperparameter combinations.

B. 5: This is the sum of the number of hyperparameter values (3 + 2), which is an incorrect calculation for a grid search.

C. 6: This correctly identifies the number of hyperparameter combinations (3 ร— 2) but omits the 3 models trained for each combination due to 3-fold cross-validation.

---

References

1. Apache Spark Official Documentation, pyspark.ml.tuning.CrossValidator: The documentation for Spark's CrossValidator describes the process: "For each paramMap, CrossValidator will split the dataset into k folds. Then it will train on k-1 folds and evaluate on the remaining fold." This confirms that for each hyperparameter combination (a paramMap), k models are trained (one for each fold). The parallelism parameter further confirms that these model fits can be executed in parallel. In this scenario, there are 6 paramMaps and k=3, resulting in 18 total model fits that can be parallelized.

Source: Apache Spark 3.5.0 Documentation, MLlib: Main Guide > ML Tuning: model selection and hyperparameter tuning > Cross-Validation.

2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. In Chapter 7, Section 7.10.1 "Cross-Validation," the authors describe K-fold cross-validation. The process involves fitting the model K times on different subsets of the training data. When combined with a grid search, this fitting process is repeated for every point in the hyperparameter grid. The independence of each model fit makes the overall process highly parallelizable.

Source: Chapter 7, "Model Assessment and Selection," Section 7.10.1, page 242.

3. Databricks Machine Learning Documentation, "Hyperparameter tuning": The documentation explains how tools like Hyperopt with SparkTrials can "distribute runs and manage models" for hyperparameter tuning. This distribution of runs across a cluster's worker nodes is the mechanism that enables the parallel training of the multiple models generated by a grid search and cross-validation process.

Source: Databricks Documentation > Machine Learning > Models > Hyperparameter tuning.

Question 11

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one- hot encoded within the feature repository. Which of the following explanations justifies this suggestion?
Options
A: One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
B: One-hot encoding is dependent on the target variableโ€™s values which differ for each apaplication.
C: One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
D: One-hot encoding is not a common strategy for representing categorical feature variables numerically.
Show Answer
Correct Answer:
One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
Explanation
A central feature repository, like the Databricks Feature Store, is designed to provide features that can be reused across multiple projects and machine learning models. Different model types have different requirements for feature representation. For example, tree-based algorithms (e.g., Random Forest, XGBoost) can often handle categorical features directly or with simple integer encoding, and one-hot encoding can create unnecessarily high-dimensional and sparse feature spaces that are detrimental to their performance. In contrast, linear models and neural networks typically require one-hot encoding. By not one-hot encoding features within the repository itself, data scientists retain the flexibility to apply the most appropriate encoding strategy for their specific algorithm during the model training phase, making the stored features more universally applicable.
Why Incorrect Options are Wrong

B. One-hot encoding is an unsupervised transformation that depends only on the unique values within the feature itself, not on the target variable.

C. While creating many new columns can increase computational load, one-hot encoding is a standard preprocessing step applied to entire datasets, not just small samples.

D. One-hot encoding is one of the most common and standard strategies for numerically representing categorical features, particularly for linear models.

References

1. Databricks Official Documentation, "Feature Engineering in Unity Catalog": The documentation advocates for a "transform-on-read" approach. It states, "The logic that is used to compute features is managed in Unity Catalog and applied to input data before a model is trained or inference is performed." This supports the principle of storing features in a more raw state and applying model-specific transformations like one-hot encoding later in the ML pipeline. (See the section on "Feature engineering patterns on Databricks").

2. Stanford University, CS229 Machine Learning Course Notes: Lecture notes on "Advice for Applying Machine Learning" and "Supervised Learning" frequently discuss feature engineering. They explain that while linear models require numerical inputs (making one-hot encoding necessary for categorical data), decision trees can intrinsically handle discrete, unordered features. This highlights that the choice of encoding is dependent on the algorithm, justifying why a universal one-hot encoding in a feature store is problematic. (See discussions on feature selection and representation for different model classes).

3. Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. In Chapter 4, "Feature Engineering," the author discusses various categorical feature encoding techniques. The text explains the trade-offs, noting that one-hot encoding can lead to the "curse of dimensionality," which is problematic for certain algorithms. This reinforces the idea that encoding is a model-dependent choice, not a universal preprocessing step suitable for a central feature repository. (See Section: "Categorical Features > One-Hot Encoding").

Question 12

A data scientist has created a linear regression model that uses log(price) as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFrame preds_df. They are using the following code block to evaluate the model: regression_evaluator.setMetricName("rmse").evaluate(preds_df) Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable with price?
Options
A: They should exponentiate the computed RMSE value
B: They should take the log of the predictions before computing the RMSE
C: They should evaluate the MSE of the log predictions to compute the RMSE
D: They should exponentiate the predictions before computing the RMSE
Show Answer
Correct Answer:
They should exponentiate the predictions before computing the RMSE
Explanation
The model was trained to predict the logarithm of the price, log(price). Consequently, its predictions are on a logarithmic scale. The Root Mean Squared Error (RMSE) is calculated based on the difference between true and predicted values. To obtain an RMSE that is interpretable in the original currency units (e.g., dollars), the error must be calculated on the original price scale. This requires applying the inverse of the logarithmic transformation, which is exponentiation, to both the predicted values and the true label values before they are passed to the evaluator. This converts both columns back to the price scale, allowing the RMSE to be computed in a directly comparable and meaningful way.
Why Incorrect Options are Wrong

A. Exponentiating the final RMSE value is a mathematically incorrect operation that does not properly transform the error back to the original scale.

B. The model's predictions are already on the log scale, so taking the logarithm again would be an erroneous double transformation.

C. Evaluating the Mean Squared Error (MSE) instead of RMSE does not address the fundamental problem of the evaluation being performed on the incorrect (logarithmic) scale.

References

1. Databricks Documentation, PySpark API Reference for pyspark.ml.evaluation.RegressionEvaluator: The documentation specifies that the evaluator computes metrics based on an input DataFrame with predictionCol and labelCol. It operates directly on the data provided. This implies that if the data in these columns is on a transformed scale (like a log scale), the user is responsible for applying the appropriate inverse transformation to the columns before evaluation to get a metric on the original scale.

Source: Apache Spark 3.5.0 documentation (as used by Databricks), pyspark.ml.evaluation.RegressionEvaluator.

2. University Courseware on Statistical Modeling: In statistical modeling, when the response variable Y is transformed (e.g., log(Y)), the model produces predictions on the transformed scale. To obtain a prediction for Y in its original units, one must back-transform the prediction. For a log transformation, the back-transformation is exponentiation. This principle is fundamental for both prediction and the evaluation of prediction error.

Source: PennState, STAT 501: Regression Methods, Lesson 9: Transformations. The course notes explicitly state: "To transform a predicted value for log Y back to a predicted value for Y, we need to take the antilog of the predicted value." This directly supports applying an exponential function before calculating error metrics.

Question 13

A data scientist is working with a feature set with the following schema: Databricks Machine Learning Associate exam question The customer_id column is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature. Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?
Options
A: customer_id, loyalty_tier
B: loyalty_tier
C: units
D: spend
E: customer_id
Show Answer
Correct Answer:
loyalty_tier
Explanation
The choice of imputation strategy depends on the data type of the feature. The strategy of imputing with the most common value (the mode) is standard practice for categorical features. In the provided schema, loyaltytier is a string type representing distinct categories, making it the only appropriate candidate for mode imputation. Numerical columns like units (integer) and spend (double) are typically imputed using the mean or median. The customerid column is a primary key; imputing missing values in an identifier column is incorrect as it would violate its uniqueness and integrity, leading to invalid data.
Why Incorrect Options are Wrong

A. customerid is a primary key and should not be imputed, as this would compromise data integrity.

C. units is a numerical column. While mode imputation is possible, mean or median imputation is generally preferred for numerical data.

D. spend is a continuous numerical column, for which mean or median imputation is the standard and more appropriate method.

E. customerid is a unique identifier. Imputing it would create duplicate or meaningless keys.

References

1. Databricks Documentation, "Handle missing data with scikit-learn and pandas": This guide explicitly states the standard practice for different data types. In the section on imputation, it recommends: "For categorical features, you can impute missing values with the most frequent value (mode)." This directly supports using the mode for the loyaltytier column.

Source: Databricks Official Documentation > Machine Learning > Data preparation > Handle missing data with scikit-learn and pandas.

2. Apache Spark Documentation, "Feature transformers - Imputer": The Imputer transformer, available in PySpark's ML library, is the tool used for this task on Databricks. It supports mean, median, and mode strategies. The documentation and common usage patterns confirm that mode is the designated strategy for categorical data, whereas mean and median are for numerical data.

Source: Apache Spark 3.5.0 Documentation > MLlib > pyspark.ml.feature.Imputer.

3. University of California, Berkeley, Courseware: In the "Data 100: Principles and Techniques of Data Science" course, the lecture on Data Cleaning outlines imputation methods. It specifies that for "Qualitative/Categorical" variables, a common strategy is to "Replace with the mode (most frequent category)." For "Quantitative" variables, it recommends replacing with the mean or median.

Source: UC Berkeley, Data 100, Spring 2024, Lecture 7: Data Cleaning, Slide 53 "Imputation Strategies".

Question 14

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0. Which of the following code blocks will accomplish this task?
Options
A: spark_df.loc[:,spark_df["discount"] <= 0]
B: spark_df[spark_df["discount"] <= 0]
C: spark_df.filter (col("discount") <= 0)
D: spark_df.loc(spark_df["discount"] <= 0, :]
Show Answer
Correct Answer:
spark_df.filter (col("discount")
Explanation
The correct method to filter rows in a PySpark DataFrame is by using the .filter() transformation. This method takes a boolean Column expression as its argument to determine which rows to keep. The code col("discount") <= 0 generates this required boolean expression, selecting all rows where the value in the "discount" column is less than or equal to zero. This is the standard, idiomatic, and performant way to filter data within the Spark ecosystem.
Why Incorrect Options are Wrong

A. sparkdf.loc[:,sparkdf["discount"] <= 0]

This syntax uses the .loc indexer, which is a feature of the pandas library for label-based indexing and does not exist in the Spark DataFrame API.

B. sparkdf[sparkdf["discount"] <= 0]

This style of boolean mask filtering is the standard syntax for pandas DataFrames. While Spark has some similar bracket notation, this specific pattern is not the idiomatic way to filter rows.

D. sparkdf.loc(sparkdf["discount"] <= 0, ๐Ÿ™‚

This syntax is invalid. It incorrectly attempts to use the pandas .loc indexer as a method, which is not an attribute or method of a Spark DataFrame.

---

References

1. Apache Spark Official Documentation: The pyspark.sql.DataFrame.filter documentation explicitly shows the correct usage. It states that the method "Filters rows using the given condition."

Source: Apache Spark 3.5.0 Documentation, pyspark.sql.DataFrame.filter.

Section: pyspark.sql.DataFrame.filter(condition)

Example provided: df.filter(df.age > 3).show() which is analogous to the correct answer's structure.

2. Databricks Official Documentation: The documentation on DataFrame transformations demonstrates filtering as a fundamental operation.

Source: Databricks Documentation, "Select, filter, sort, and aggregate data".

Section: "Filter rows"

The documentation provides the example df.filter(df.age > 21), confirming that .filter() with a column condition is the correct approach.

3. University Courseware (UC Berkeley): Course materials for data science programs frequently cover the distinction between pandas and Spark APIs.

Source: UC Berkeley, Data 100, "A Gentle Introduction to PySpark".

Section: "Filtering Data"

The guide illustrates filtering with trip.filter(trip['Duration'] > 1000), which is functionally identical to using the col() function and reinforces that .filter() is the correct transformation.

Question 15

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process. Which change could the data scientist make to improve their model accuracy over the course of their tuning process?
Options
A: Change the number of compute nodes to be half or less than half of the number of evaluations.
B: Change the number of compute nodes and the number of evaluations to be much larger but equal.
C: Change the iterative optimization algorithm used to facilitate the tuning process.
D: Change the number of compute nodes to be double or more than double the number of evaluations.
Show Answer
Correct Answer:
Change the number of compute nodes to be half or less than half of the number of evaluations.
Explanation
The core issue is the conflict between the iterative nature of the optimization algorithm and the fully parallelized execution. An iterative (or sequential) optimization algorithm, such as Hyperopt's Tree-structured Parzen Estimator (TPE), improves by learning from the results of past evaluations to inform the choice of hyperparameters for future evaluations. When the number of parallel evaluations (on 8 nodes) equals the total number of evaluations (8), all trials start simultaneously. The algorithm has no completed results to learn from, effectively making the process a random search rather than an intelligent, iterative one. By reducing the number of compute nodes to be less than the number of evaluations (e.g., 4 nodes for 8 evaluations), the algorithm can run an initial batch, analyze the results, and then intelligently select better hyperparameter combinations for the subsequent batch, creating the desired trend of improvement.
Why Incorrect Options are Wrong

B. Change the number of compute nodes and the number of evaluations to be much larger but equal.

This maintains the fully parallelized setup, which is the root cause of the problem, and would not introduce any sequential learning.

C. Change the iterative optimization algorithm used to facilitate the tuning process.

The problem is not the specific algorithm but its execution mode. Any iterative algorithm would fail to show sequential improvement under full parallelization.

D. Change the number of compute nodes to be double or more than double the number of evaluations.

This is inefficient as extra nodes would be idle, and it does not solve the problem since all eight evaluations would still run in parallel.

References

1. Databricks Official Documentation, "Parallelize hyperparameter tuning with Hyperopt": In the section "How Hyperopt works with SparkTrials", the documentation explains the parallelism parameter. It states, "When parallelism is greater than 1, TPE uses a random search strategy to select the first parallelism sets of hyperparameters, and then it uses the results of these initial sets to seed the TPE algorithm for subsequent sets of hyperparameters." This confirms that if parallelism equals the total number of evaluations, the process never moves beyond the initial random search phase, preventing iterative improvement. Reducing parallelism allows for subsequent, informed trials.

2. Bergstra, J., Bardenet, R., Bengio, Y., & Kรฉgl, B. (2011). Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems 24 (NIPS 2011).: This paper introduces the TPE algorithm. It describes TPE as a sequential process where the choice of the next hyperparameter configuration is informed by observations of previous configurations. This sequential nature is fundamentally disrupted by a fully parallel execution where no "previous" observations are available for the initial (and only) batch of trials. The paper's description of the sequential model-based optimization process underpins why reducing parallelism (Option A) is necessary to leverage the algorithm's strength. (Available via NeurIPS Proceedings).

Shopping Cart
Scroll to Top

FLASH OFFER

Days
Hours
Minutes
Seconds

avail $6 DISCOUNT on YOUR PURCHASE