1. Apache Spark Official Documentation
pyspark.ml.tuning.CrossValidator: The documentation for Spark's CrossValidator describes the process: "For each paramMap
CrossValidator will split the dataset into k folds. Then it will train on k-1 folds and evaluate on the remaining fold." This confirms that for each hyperparameter combination (a paramMap)
k models are trained (one for each fold). The parallelism parameter further confirms that these model fits can be executed in parallel. In this scenario
there are 6 paramMaps and k=3
resulting in 18 total model fits that can be parallelized.
Source: Apache Spark 3.5.0 Documentation
MLlib: Main Guide > ML Tuning: model selection and hyperparameter tuning > Cross-Validation.
2. Hastie
T.
Tibshirani
R.
& Friedman
J. (2009). The Elements of Statistical Learning: Data Mining
Inference
and Prediction. Springer. In Chapter 7
Section 7.10.1 "Cross-Validation
" the authors describe K-fold cross-validation. The process involves fitting the model K times on different subsets of the training data. When combined with a grid search
this fitting process is repeated for every point in the hyperparameter grid. The independence of each model fit makes the overall process highly parallelizable.
Source: Chapter 7
"Model Assessment and Selection
" Section 7.10.1
page 242.
3. Databricks Machine Learning Documentation
"Hyperparameter tuning": The documentation explains how tools like Hyperopt with SparkTrials can "distribute runs and manage models" for hyperparameter tuning. This distribution of runs across a cluster's worker nodes is the mechanism that enables the parallel training of the multiple models generated by a grid search and cross-validation process.
Source: Databricks Documentation > Machine Learning > Models > Hyperparameter tuning.