Study Smarter for the Machine Learning Associate Exam with Our Free and Accurate Machine Learning Associate Exam Questions โ Updated for 2025.
At Cert Empire, we are committed to providing the most reliable and up-to-date exam questions for students preparing for the Databricks Machine Learning Associate Exam. To help learners study more effectively, weโve made sections of our Machine Learning Associate exam resources free for everyone. You can practice as much as you want with Free Machine Learning Associate Practice Test.
Question 1
Show Answer
B. Cloning notebooks to a Workspace folder removes them from Git version control, making integration of changes difficult and breaking the established project workflow.
C. Creating an entirely new Git repository disconnects the work from the original project's history and makes merging changes back a manual, complex process.
D. Cloning into a new Databricks Repo creates a separate project fork, which is an unnecessarily complex approach compared to using a branch for feature development.
1. Databricks Official Documentation, "Git operations with Databricks Repos": This document outlines the standard Git workflow within Databricks. The section "Create a new branch" explicitly describes the procedure for this task: "You can create a new branch based on an existing branch from within a repo... This is the best practice for developing your new work." This directly supports the methodology in option A as the recommended approach for isolated development.
Source: Databricks Documentation, docs/en/repos/git-operations-with-repos.html, Section: "Create a new branch".
2. Databricks Official Documentation, "CI/CD techniques with Git and Databricks Repos": This guide on best practices for development workflows emphasizes using separate Git branches for development work ("feature branches") to isolate changes from the main production branch. It states, "A common workflow is to create a new feature branch for your work... You can make your changes, and then commit and push them to the Git provider." This aligns perfectly with the scenario of improving a feature without affecting the daily production job.
Source: Databricks Documentation, docs/en/repos/ci-cd-repos.html, Section: "Development workflow".
Question 2
Show Answer
A. Running notebooks interactively is a manual debugging step performed after identifying the failed task, not the method to identify it.
C. Migrating to Delta Live Tables is a major architectural change and is not a tool for diagnosing a standard Databricks Job failure.
D. Using dedicated clusters is a configuration change to prevent future failures, not a method to diagnose a past failed run.
1. Databricks Official Documentation, "View job runs": This document describes how to monitor job runs. It states, "To view the run history for a job, click the job name in the Jobs list... You can view the matrix or Gantt chart of job runs and a list of job runs." The matrix view visually distinguishes between successful and failed tasks. (See section: "View the runs for a job").
2. Databricks Official Documentation, "Troubleshoot and fix job failures": This guide explicitly recommends using the job run history to diagnose failures. It states, "If a job fails, you can investigate the cause of the failure by viewing the jobโs run history... The job run details page shows which tasks ran successfully and which failed." (See section: "View job run details to determine the cause of a job failure").
Question 3
Show Answer
A. Single Node cluster: This cluster has no worker nodes and is not designed for distributed computing, which is the primary advantage of using Spark ML for scalable machine learning tasks.
C. SQL Warehouse: This compute resource is optimized specifically for running SQL queries for business intelligence (BI) and analytics. It cannot be used to execute general-purpose code or machine learning libraries like Spark ML.
D. None of these compute tools support this task: This is incorrect because a Standard cluster is the designated and appropriate compute resource for this exact combination of tasks on the Databricks platform.
---
1. Databricks Official Documentation, "What is Databricks compute?": This document distinguishes between different compute types. It states, "For data science and data engineering workloads, you can use either all-purpose compute or job compute." It further clarifies that SQL warehouses are for "running SQL queries with Databricks SQL." This directly supports using a Standard (all-purpose) cluster for a data science workload involving Spark ML and explicitly excludes SQL Warehouses.
Source: Databricks Documentation > Compute > Get started > What is Databricks compute?
2. Databricks Official Documentation, "Cluster modes": This page details the different cluster modes. It describes the "Standard" mode as the recommended option for single users, capable of running workloads in various languages (Python, R, Scala, SQL). It contrasts this with the "Single Node" mode, which has no workers and is not suitable for large-scale Spark jobs.
Source: Databricks Documentation > Compute > Configuration > Cluster modes.
3. Databricks Official Documentation, "What is a SQL warehouse?": This source defines the purpose of a SQL Warehouse. It states, "A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL." This confirms its specialization for SQL and BI tools, not for programmatic machine learning tasks found in notebooks.
Source: Databricks Documentation > SQL > Get started > What is a SQL warehouse?
Question 4
In which situation will the machine learning engineerโs code block perform the desired inference?Show Answer
B. The code as written does not reference or perform any joins with other Spark DataFrames, so their mere existence in the environment is irrelevant.
C. A linear regression model requires numerical features for computation. A string identifier like customerid is unsuitable as the sole feature and would likely cause a type error.
D. This is incorrect because the integration between MLflow and the Databricks Feature Store (as described in option A) provides a valid and common pattern for this code to work.
E. While the features must exist in the Feature Store, the critical condition is that the model was logged with the feature lookup metadata, making it "feature-aware".
1. Databricks Official Documentation, "Train models and perform batch inference with Feature Store":
Section: "Perform batch inference"
Content: This section explicitly states: "To score a model on new data, use the featurestore.scorebatch method... This method looks up the features for the data in df from the feature tables specified in the featurelookups used when the model was logged...". The mlflow.pyfunc.sparkudf leverages this underlying capability for models logged with feature metadata. This directly supports that logging the feature set with the model is the required condition.
2. Databricks Official Documentation, "Feature Store Python API reference":
Section: databricks.featurestore.client.FeatureStoreClient.logmodel
Content: The documentation for this function explains that it "Packages the model with feature metadata." This metadata is precisely what enables the automatic feature lookup during inference, which is the core principle making the code in the question work.
3. Databricks Official Documentation, "What is a feature store?":
Section: "Model training and inference with Feature Store"
Content: The documentation illustrates the MLOps lifecycle, showing that for inference, the model can be provided with only primary keys. The model then "retrieves the precomputed features from the feature store" before making a prediction. This confirms the pattern described in the correct answer.
Question 5
Show Answer
B. R-squared: This is a standard regression metric measuring the proportion of variance in the target variable that is predictable from the features.
C. MAE: Mean Absolute Error is a common regression metric that measures the average magnitude of the errors in a set of predictions.
D. MSE: Mean Squared Error is a widely used regression metric that measures the average of the squared differences between predicted and actual values.
1. Databricks Official Documentation, "Regression and forecasting: model metrics": This document explicitly lists the metrics calculated for each run in a Databricks AutoML regression experiment. The primary metric is root mean squared error (RMSE), and other generated metrics include R-squared, MAE, and MSE. The F1 score is not mentioned for regression.
Source: Databricks Machine Learning Guide > AutoML > Reference > Regression and forecasting: model metrics.
2. Databricks Official Documentation, "Classification: model metrics": This document lists the metrics for AutoML classification experiments, which include F1 score, accuracy, log loss, precision, and recall. This confirms that F1 is a classification metric within the Databricks ecosystem.
Source: Databricks Machine Learning Guide > AutoML > Reference > Classification: model metrics.
3. Stanford University, CS229 Machine Learning Course Notes: In the course materials covering evaluation metrics, a clear distinction is made. Metrics for classification include accuracy, precision, recall, and F1 score. Metrics for regression include Mean Squared Error (MSE) and Mean Absolute Error (MAE).
Source: Ng, A. (2023). CS229 Machine Learning Course Notes, Stanford University, "Part V: Learning Theory" and "Part VI: Evaluation Metrics".
Question 6
The code block is not accomplishing the task.
Which reasons describes why the code block is not accomplishing the imputation task?Show Answer
A. The question does not mention a test set. The code fails to impute any data, making the train/test distinction irrelevant to the core error.
B. Setting inputCols and outputCols to the same list of columns is a valid pattern in Spark ML used to overwrite the original columns with the imputed values.
C. The .fit() method does not replace the .transform() method. The correct workflow requires calling .fit() first to create a model, and then calling .transform() on that model.
1. Apache Spark Official Documentation, MLlib Guide, Feature Extractors, Transformers, and Selectors: "An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. It is implemented as a class that has a fit() method. The fit() method accepts a DataFrame and returns a Model, which is a Transformer."
Source: Apache Spark 3.5.0 Documentation, MLlib > ML Pipelines > Main Concepts in Pipelines.
2. Databricks Official Documentation, Impute missing values: The official example code demonstrates the correct two-step process. First, an Imputer is instantiated. Second, the .fit() method is called on the DataFrame to create a model. Third, the .transform() method is called on the resulting model.
Source: Databricks Documentation > Machine Learning > Feature engineering > Impute missing values. The example code clearly shows model = imputer.fit(df) followed by model.transform(df).show().
3. Apache Spark Official Documentation, pyspark.ml.feature.Imputer API: The API documentation specifies that Imputer is an Estimator with a fit() method that returns an ImputerModel. The ImputerModel is a Transformer that possesses the transform() method. This confirms the required sequence of operations.
Source: Apache Spark 3.5.0 Documentation, pyspark.ml.feature Module > Imputer class.
Question 7
The code block is returning an error.
Which of the following adjustments does the data scientist need to make to accomplish this task?Show Answer
A. The method parameter for OneHotEncoder is deprecated since Spark 3.0 and is not the source of the error, which is related to the input data type.
B. The .fit() method is essential for an Estimator like OneHotEncoder. It learns the number of categories from the data to correctly build the model for transformation.
D. VectorAssembler is used to combine multiple feature columns into a single vector column, a step that is typically performed after one-hot encoding, not before.
1. Databricks Official Documentation, "One-hot encoder": "One-hot encoder maps a column of category indices to a column of binary vectors... The input columns must be of numeric type." This statement confirms that the input cannot be a string and must be a category index.
Source: Databricks Documentation > Docs > Machine Learning > Feature engineering > Feature transformers > One-hot encoder.
2. Apache Spark 3.4.1 MLlib Guide, "Feature Transformers - OneHotEncoder": "It is common to have StringIndexer compute the label indices and then OneHotEncoder to encode the indexed labels." This explicitly describes the required two-step process.
Source: Apache Spark Documentation > Spark 3.4.1 > MLlib > MLlib Programming Guide > Feature Extraction, Transformation and Selection > OneHotEncoder.
3. Databricks Official Documentation, "StringIndexer": The example code on this page demonstrates the standard pipeline where a StringIndexer is applied first, followed by a OneHotEncoder, confirming the correct sequence of operations.
Source: Databricks Documentation > Docs > Machine Learning > Feature engineering > Feature transformers > StringIndexer.
Question 8
Show Answer
A. sparkdf.describe(): This function returns a DataFrame with summary statistics (count, mean, stddev, min, max) but does not generate any visual plots or histograms.
B. dbutils.data(sparkdf).summarize(): This is syntactically incorrect. The DataFrame should be passed as an argument to the summarize function, not chained in this manner.
C. This task cannot be accomplished in a single line of code: This is incorrect because dbutils.data.summarize(sparkdf) accomplishes the task in one line.
D. sparkdf.summary(): Similar to .describe(), this function returns a DataFrame with summary statistics but does not produce visual histograms.
1. Databricks Official Documentation, "Data utility (dbutils.data)": This document explicitly describes the summarize command. It states, "The summarize command performs a summary of the columns of a Spark DataFrame and returns the result in a structured and visual way." The page includes an example output showing the generated histograms.
Reference: Databricks Documentation > Reference > Databricks utilities (dbutils) > Data utility (dbutils.data).
2. Apache Spark Official Documentation, pyspark.sql.DataFrame.describe: This API documentation confirms that the describe() method "Computes basic statistics for numeric and string columns" and returns a new DataFrame with these statistics, with no mention of visualization capabilities.
Reference: Apache Spark 3.5.0 > PySpark API Reference > pyspark.sql > pyspark.sql.DataFrame.describe.
3. Apache Spark Official Documentation, pyspark.sql.DataFrame.summary: This API documentation details that the summary() method computes specified statistics for columns and returns them as a DataFrame. It is a more advanced version of describe() but still does not generate visual output.
Reference: Apache Spark 3.5.0 > PySpark API Reference > pyspark.sql > pyspark.sql.DataFrame.summary.
Question 9
Show Answer
A. fmin: This is the main optimization function that runs the tuning loop, but it requires a SparkTrials object to actually execute the trials in parallel.
C. quniform: This is a function used within the search space to define a quantized uniform distribution for a hyperparameter; it does not control execution.
D. searchspace: This is a dictionary that defines the hyperparameters and their distributions to be tested; it is not a tool for parallelization.
E. objectivefunction: This user-defined function evaluates a single set of hyperparameters; it does not manage the parallel execution of multiple evaluations.
1. Databricks Official Documentation, "Parallelize hyperparameter tuning with scikit-learn and MLflow": This document explicitly states, "To parallelize tuning, use the SparkTrials class. SparkTrials takes one argument, parallelism... Hyperopt evaluates this many trials in parallel." It provides a clear code example showing fmin being used with a SparkTrials object. (Databricks Machine Learning Guide > Hyperparameter tuning > Scikit-learn, MLflow, and automated MLflow tracking > Parallelize hyperparameter tuning with scikit-learn and MLflow).
2. Databricks Official Documentation, "Hyperopt concepts": In the section describing the components of a Hyperopt workflow, it defines SparkTrials as the class to "Use to scale up tuning. SparkTrials distributes trials to Spark workers." (Databricks Machine Learning Guide > Hyperparameter tuning > Hyperopt concepts).
Question 10
Show Answer
A. 3: This represents only the number of cross-validation folds, not the total number of models trained across all hyperparameter combinations.
B. 5: This is the sum of the number of hyperparameter values (3 + 2), which is an incorrect calculation for a grid search.
C. 6: This correctly identifies the number of hyperparameter combinations (3 ร 2) but omits the 3 models trained for each combination due to 3-fold cross-validation.
---
1. Apache Spark Official Documentation, pyspark.ml.tuning.CrossValidator: The documentation for Spark's CrossValidator describes the process: "For each paramMap, CrossValidator will split the dataset into k folds. Then it will train on k-1 folds and evaluate on the remaining fold." This confirms that for each hyperparameter combination (a paramMap), k models are trained (one for each fold). The parallelism parameter further confirms that these model fits can be executed in parallel. In this scenario, there are 6 paramMaps and k=3, resulting in 18 total model fits that can be parallelized.
Source: Apache Spark 3.5.0 Documentation, MLlib: Main Guide > ML Tuning: model selection and hyperparameter tuning > Cross-Validation.
2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. In Chapter 7, Section 7.10.1 "Cross-Validation," the authors describe K-fold cross-validation. The process involves fitting the model K times on different subsets of the training data. When combined with a grid search, this fitting process is repeated for every point in the hyperparameter grid. The independence of each model fit makes the overall process highly parallelizable.
Source: Chapter 7, "Model Assessment and Selection," Section 7.10.1, page 242.
3. Databricks Machine Learning Documentation, "Hyperparameter tuning": The documentation explains how tools like Hyperopt with SparkTrials can "distribute runs and manage models" for hyperparameter tuning. This distribution of runs across a cluster's worker nodes is the mechanism that enables the parallel training of the multiple models generated by a grid search and cross-validation process.
Source: Databricks Documentation > Machine Learning > Models > Hyperparameter tuning.
Question 11
Show Answer
B. One-hot encoding is an unsupervised transformation that depends only on the unique values within the feature itself, not on the target variable.
C. While creating many new columns can increase computational load, one-hot encoding is a standard preprocessing step applied to entire datasets, not just small samples.
D. One-hot encoding is one of the most common and standard strategies for numerically representing categorical features, particularly for linear models.
1. Databricks Official Documentation, "Feature Engineering in Unity Catalog": The documentation advocates for a "transform-on-read" approach. It states, "The logic that is used to compute features is managed in Unity Catalog and applied to input data before a model is trained or inference is performed." This supports the principle of storing features in a more raw state and applying model-specific transformations like one-hot encoding later in the ML pipeline. (See the section on "Feature engineering patterns on Databricks").
2. Stanford University, CS229 Machine Learning Course Notes: Lecture notes on "Advice for Applying Machine Learning" and "Supervised Learning" frequently discuss feature engineering. They explain that while linear models require numerical inputs (making one-hot encoding necessary for categorical data), decision trees can intrinsically handle discrete, unordered features. This highlights that the choice of encoding is dependent on the algorithm, justifying why a universal one-hot encoding in a feature store is problematic. (See discussions on feature selection and representation for different model classes).
3. Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. In Chapter 4, "Feature Engineering," the author discusses various categorical feature encoding techniques. The text explains the trade-offs, noting that one-hot encoding can lead to the "curse of dimensionality," which is problematic for certain algorithms. This reinforces the idea that encoding is a model-dependent choice, not a universal preprocessing step suitable for a central feature repository. (See Section: "Categorical Features > One-Hot Encoding").
Question 12
Show Answer
A. Exponentiating the final RMSE value is a mathematically incorrect operation that does not properly transform the error back to the original scale.
B. The model's predictions are already on the log scale, so taking the logarithm again would be an erroneous double transformation.
C. Evaluating the Mean Squared Error (MSE) instead of RMSE does not address the fundamental problem of the evaluation being performed on the incorrect (logarithmic) scale.
1. Databricks Documentation, PySpark API Reference for pyspark.ml.evaluation.RegressionEvaluator: The documentation specifies that the evaluator computes metrics based on an input DataFrame with predictionCol and labelCol. It operates directly on the data provided. This implies that if the data in these columns is on a transformed scale (like a log scale), the user is responsible for applying the appropriate inverse transformation to the columns before evaluation to get a metric on the original scale.
Source: Apache Spark 3.5.0 documentation (as used by Databricks), pyspark.ml.evaluation.RegressionEvaluator.
2. University Courseware on Statistical Modeling: In statistical modeling, when the response variable Y is transformed (e.g., log(Y)), the model produces predictions on the transformed scale. To obtain a prediction for Y in its original units, one must back-transform the prediction. For a log transformation, the back-transformation is exponentiation. This principle is fundamental for both prediction and the evaluation of prediction error.
Source: PennState, STAT 501: Regression Methods, Lesson 9: Transformations. The course notes explicitly state: "To transform a predicted value for log Y back to a predicted value for Y, we need to take the antilog of the predicted value." This directly supports applying an exponential function before calculating error metrics.
Question 13
The customer_id column is the primary key in the feature set. Each of the columns in the feature set
has missing values. They want to replace the missing values by imputing a common value for each
feature.
Which of the following lists all of the columns in the feature set that need to be imputed using the
most common value of the column?Show Answer
A. customerid is a primary key and should not be imputed, as this would compromise data integrity.
C. units is a numerical column. While mode imputation is possible, mean or median imputation is generally preferred for numerical data.
D. spend is a continuous numerical column, for which mean or median imputation is the standard and more appropriate method.
E. customerid is a unique identifier. Imputing it would create duplicate or meaningless keys.
1. Databricks Documentation, "Handle missing data with scikit-learn and pandas": This guide explicitly states the standard practice for different data types. In the section on imputation, it recommends: "For categorical features, you can impute missing values with the most frequent value (mode)." This directly supports using the mode for the loyaltytier column.
Source: Databricks Official Documentation > Machine Learning > Data preparation > Handle missing data with scikit-learn and pandas.
2. Apache Spark Documentation, "Feature transformers - Imputer": The Imputer transformer, available in PySpark's ML library, is the tool used for this task on Databricks. It supports mean, median, and mode strategies. The documentation and common usage patterns confirm that mode is the designated strategy for categorical data, whereas mean and median are for numerical data.
Source: Apache Spark 3.5.0 Documentation > MLlib > pyspark.ml.feature.Imputer.
3. University of California, Berkeley, Courseware: In the "Data 100: Principles and Techniques of Data Science" course, the lecture on Data Cleaning outlines imputation methods. It specifies that for "Qualitative/Categorical" variables, a common strategy is to "Replace with the mode (most frequent category)." For "Quantitative" variables, it recommends replacing with the mean or median.
Source: UC Berkeley, Data 100, Spring 2024, Lecture 7: Data Cleaning, Slide 53 "Imputation Strategies".
Question 14
Show Answer
A. sparkdf.loc[:,sparkdf["discount"] <= 0]
This syntax uses the .loc indexer, which is a feature of the pandas library for label-based indexing and does not exist in the Spark DataFrame API.
B. sparkdf[sparkdf["discount"] <= 0]
This style of boolean mask filtering is the standard syntax for pandas DataFrames. While Spark has some similar bracket notation, this specific pattern is not the idiomatic way to filter rows.
D. sparkdf.loc(sparkdf["discount"] <= 0, ๐
This syntax is invalid. It incorrectly attempts to use the pandas .loc indexer as a method, which is not an attribute or method of a Spark DataFrame.
---
1. Apache Spark Official Documentation: The pyspark.sql.DataFrame.filter documentation explicitly shows the correct usage. It states that the method "Filters rows using the given condition."
Source: Apache Spark 3.5.0 Documentation, pyspark.sql.DataFrame.filter.
Section: pyspark.sql.DataFrame.filter(condition)
Example provided: df.filter(df.age > 3).show() which is analogous to the correct answer's structure.
2. Databricks Official Documentation: The documentation on DataFrame transformations demonstrates filtering as a fundamental operation.
Source: Databricks Documentation, "Select, filter, sort, and aggregate data".
Section: "Filter rows"
The documentation provides the example df.filter(df.age > 21), confirming that .filter() with a column condition is the correct approach.
3. University Courseware (UC Berkeley): Course materials for data science programs frequently cover the distinction between pandas and Spark APIs.
Source: UC Berkeley, Data 100, "A Gentle Introduction to PySpark".
Section: "Filtering Data"
The guide illustrates filtering with trip.filter(trip['Duration'] > 1000), which is functionally identical to using the col() function and reinforces that .filter() is the correct transformation.
Question 15
Show Answer
B. Change the number of compute nodes and the number of evaluations to be much larger but equal.
This maintains the fully parallelized setup, which is the root cause of the problem, and would not introduce any sequential learning.
C. Change the iterative optimization algorithm used to facilitate the tuning process.
The problem is not the specific algorithm but its execution mode. Any iterative algorithm would fail to show sequential improvement under full parallelization.
D. Change the number of compute nodes to be double or more than double the number of evaluations.
This is inefficient as extra nodes would be idle, and it does not solve the problem since all eight evaluations would still run in parallel.
1. Databricks Official Documentation, "Parallelize hyperparameter tuning with Hyperopt": In the section "How Hyperopt works with SparkTrials", the documentation explains the parallelism parameter. It states, "When parallelism is greater than 1, TPE uses a random search strategy to select the first parallelism sets of hyperparameters, and then it uses the results of these initial sets to seed the TPE algorithm for subsequent sets of hyperparameters." This confirms that if parallelism equals the total number of evaluations, the process never moves beyond the initial random search phase, preventing iterative improvement. Reducing parallelism allows for subsequent, informed trials.
2. Bergstra, J., Bardenet, R., Bengio, Y., & Kรฉgl, B. (2011). Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems 24 (NIPS 2011).: This paper introduces the TPE algorithm. It describes TPE as a sequential process where the choice of the next hyperparameter configuration is informed by observations of previous configurations. This sequential nature is fundamentally disrupted by a fully parallel execution where no "previous" observations are available for the initial (and only) batch of trials. The paper's description of the sequential model-based optimization process underpins why reducing parallelism (Option A) is necessary to leverage the algorithm's strength. (Available via NeurIPS Proceedings).