Question 13 - Databricks Machine Learning Associate Real Exam Questions [Feb 2026 Update]

Q: 13

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model. Which of the following possible explanations for this difference is invalid?

Options

Correct Answer:

Explanation

The question asks for the invalid explanation for the observed difference in Root Mean Squared Error (RMSE) between the two models. RMSE is a standard, widely used, and fundamentally valid metric for evaluating the performance of regression models. It measures the square root of the average of squared differences between predicted and actual values, providing an estimate of the error magnitude in the units of the label. Therefore, the statement that RMSE is an invalid evaluation metric for regression problems is factually incorrect and constitutes an invalid explanation. The other options describe plausible scenarios, including a common error (B) where predictions on a transformed scale are not converted back to the original scale before evaluation.

Why Incorrect

A. This is a plausible scenario. The second model could be more accurate, but the RMSE was miscalculated (e.g., as described in B), leading to a misleadingly high value.

B. This is a very likely and valid explanation. Failing to apply the exponential function to the second model's log-scale predictions before comparing them to the original price would result in a massive, meaningless RMSE.

C. This describes an irrelevant action. There is no reason to take the log of the first model's predictions, so this does not explain the observed difference. However, it is not a fundamentally false statement about ML principles like option E.

D. This is a simple and valid possibility. The first model, which directly models price, might genuinely be a better fit for the data than the log-transformed model.

References

1. Databricks Official Documentation

pyspark.ml.evaluation.RegressionEvaluator: The official API documentation for PySpark's RegressionEvaluator class lists rmse (Root Mean Squared Error) as the default metric for evaluation. This confirms its validity and standard use within the Databricks ecosystem.

Source: Apache Spark 3.5.0 Documentation > PySpark > pyspark.ml > pyspark.ml.evaluation. Section: metricName.

2. Databricks Official Documentation

"Regression: predict house prices" Tutorial: This official Databricks tutorial builds a regression model and explicitly uses RMSE to evaluate its performance. The "Evaluate the model" section states

"First

we'll look at the root mean squared error (RMSE). This metric is the square root of the mean squared error. It is a common metric to evaluate regression models."

Source: Databricks Machine Learning Guide > Tutorials > Regression: predict house prices.

3. University Courseware

"An Introduction to Statistical Learning": This is a standard textbook in many university statistics and machine learning programs. In Chapter 2

"Statistical Learning

" Section 2.1.5

"Measuring the Quality of Fit

" the Mean Squared Error (MSE) is introduced as the primary method for assessing the accuracy of a regression model. RMSE is the square root of MSE and is used for the same purpose

often preferred because its units are the same as the response variable.

Source: James

Witten

Hastie

& Tibshirani

R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. (Page 30).

4. University Courseware

"An Introduction to Statistical Learning" on Transformations: In Chapter 3

"Linear Regression

" Section 3.3.3

"Other Considerations in the Regression Model

" the text discusses addressing non-linearity by transforming variables. A common approach is replacing the response Y with log(Y). When making predictions with such a model

one must remember to transform the prediction back to the original scale for interpretation and evaluation

which supports why option (B) is a valid potential explanation for error.

Source: James

Witten

Hastie

& Tibshirani

R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. (Page 93).

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE