1. Databricks Official Documentation
"Feature engineering and featurization": This document states
"You can use Spark ML to scale feature engineering... Spark ML provides a broad set of feature transformers that you can apply to your data." It then lists numerous built-in transformers like StandardScaler and VectorAssembler that operate on Spark DataFrames. (See section: "Feature engineering with MLlib").
2. Databricks Official Documentation
"Single-node and distributed training": This page contrasts single-node libraries with distributed ones. It clarifies
"For big data
you can use Spark’s machine learning library
MLlib... MLlib is a distributed library and is the only ML library that comes pre-installed on Databricks Runtimes." This highlights that Spark ML is the native distributed tool
unlike Scikit-learn. (See section: "Distributed training").
3. Karau
H.
Konwinski
A.
Wendell
P.
& Zaharia
M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media
Inc. Chapter 6
"Machine Learning with MLlib
" describes the library's architecture
stating
"MLlib is Spark’s library of machine learning algorithms... The feature extraction and transformation utilities in MLlib help you create features from your raw data." The chapter details transformers that are inherently distributed. (See Chapter 6
"Feature Extraction and Transformation").