A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block: https://kxbjsyuhceggsyvxdkof.supabase.co/storage/v1/object/public/file-images/Databricks-Machine-Learning-Associate/page_16_img_1.jpg Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set. Which of the following changes can the data scientist make to address the concern?

Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

Question 14 - Databricks Machine Learning Associate Real Exam Questions [Feb 2026 Update]

Q: 14

A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block:

Databricks Machine Learning Associate question

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set. Which of the following changes can the data scientist make to address the concern?

Options

Correct Answer:

Explanation

The primary concern is data leakage, which occurs when information from the test set is used to train the model. Standardizing the entire dataset before splitting allows the scaler to learn the mean and standard deviation from the test data, which then influences the transformation of the training data. The correct procedure is to split the data first, then fit the scaler only on the training data. This fitted scaler, which now contains the summary statistics of the training data, is then used to transform both the training and the test sets. The Spark ML Pipeline API is designed to correctly sequence these operations, ensuring that estimators (like StandardScaler) are fitted only on the training data during the pipeline.fit() call, and the resulting transformer is then applied to any dataset, including the test set.

Why Incorrect

A. Using global minimum and maximum values from the entire dataset is the definition of data leakage, which is the problem that needs to be solved.

B. This is also incorrect as it uses global statistics, leading to data leakage from the test set into the transformation process.

C. Cross-validation is a model validation technique, not a substitute for feature scaling. Scaling must still be performed correctly within each fold of the cross-validation process.

D. Standardizing the training data using statistics from the test data is a severe and fundamentally incorrect form of data leakage.

References

1. Databricks Official Documentation

"ML end-to-end example": This tutorial demonstrates the correct workflow. A Pipeline containing feature engineering stages is created. The documentation shows the pipeline being fitted only on the training data (pipelinemodel = pipeline.fit(traindf))

and then this fitted model is used to transform the test data (preddf = pipelinemodel.transform(testdf)). This directly supports the methodology described in the correct answer. (See: Databricks Documentation -> Machine Learning -> Tutorials -> ML end-to-end example -> Section: "Create a machine learning model").

2. Apache Spark Official Documentation

"ML Pipelines": The documentation explains that when a Pipeline.fit() method is called

it calls fit() on each Estimator stage (like StandardScaler) in sequence. The resulting Transformer becomes part of the PipelineModel. This PipelineModel can then be used to transform() new data. This confirms that the statistics for scaling are learned only from the data passed to fit()

which should be the training set. (See: Apache Spark 3.5.0 Documentation -> MLlib: Machine Learning Library -> ML Pipelines -> Section: "How it works").

3. Hastie

Tibshirani

& Friedman

J. (2009). The Elements of Statistical Learning. Springer. In Chapter 7

Section 10.2

"The Wrong and Right Way to Do Cross-validation

" the authors explicitly warn against this error. They state

"The test data should be strictly held out from all aspects of the model fitting

including feature scaling. Any preprocessing steps that use data

such as computing means and variances for standardization

must be learned from the training data only and then applied to the test data." This foundational principle of machine learning directly applies to a train-test split.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE