1. Databricks Official Documentation
pyspark.ml.evaluation.RegressionEvaluator: The official API documentation for PySpark's RegressionEvaluator class lists rmse (Root Mean Squared Error) as the default metric for evaluation. This confirms its validity and standard use within the Databricks ecosystem.
Source: Apache Spark 3.5.0 Documentation > PySpark > pyspark.ml > pyspark.ml.evaluation. Section: metricName.
2. Databricks Official Documentation
"Regression: predict house prices" Tutorial: This official Databricks tutorial builds a regression model and explicitly uses RMSE to evaluate its performance. The "Evaluate the model" section states
"First
we'll look at the root mean squared error (RMSE). This metric is the square root of the mean squared error. It is a common metric to evaluate regression models."
Source: Databricks Machine Learning Guide > Tutorials > Regression: predict house prices.
3. University Courseware
"An Introduction to Statistical Learning": This is a standard textbook in many university statistics and machine learning programs. In Chapter 2
"Statistical Learning
" Section 2.1.5
"Measuring the Quality of Fit
" the Mean Squared Error (MSE) is introduced as the primary method for assessing the accuracy of a regression model. RMSE is the square root of MSE and is used for the same purpose
often preferred because its units are the same as the response variable.
Source: James
G.
Witten
D.
Hastie
T.
& Tibshirani
R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. (Page 30).
4. University Courseware
"An Introduction to Statistical Learning" on Transformations: In Chapter 3
"Linear Regression
" Section 3.3.3
"Other Considerations in the Regression Model
" the text discusses addressing non-linearity by transforming variables. A common approach is replacing the response Y with log(Y). When making predictions with such a model
one must remember to transform the prediction back to the original scale for interpretation and evaluation
which supports why option (B) is a valid potential explanation for error.
Source: James
G.
Witten
D.
Hastie
T.
& Tibshirani
R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. (Page 93).