1. Official Vendor Documentation (Databricks): The Databricks documentation on "Feature engineering in Unity Catalog" emphasizes the principle of reusability. It states
"The feature store provides a centralized repository that enables discovery and reuse of features." Pre-applying a model-specific transformation like OHE would limit this reusability
as not all models benefit from it. This aligns with the principle of keeping features in a more general state within the store.
Source: Databricks Documentation
"Feature engineering in Unity Catalog > What is a feature store?".
2. Academic Publication: In the textbook The Elements of Statistical Learning
the authors explain how different model families handle categorical predictors. Linear models require dummy variables (one-hot encoding)
while tree-based models can handle them natively. This highlights that the encoding strategy is algorithm-dependent.
Source: Hastie
T.
Tibshirani
R.
& Friedman
J. (2009). The Elements of Statistical Learning: Data Mining
Inference
and Prediction. Springer. Chapter 9
Section 9.2.4 "Other Issues
" discusses how tree-based models can handle categorical predictors without creating dummy variables.
3. University Courseware: Stanford's course on Machine Learning Systems Design discusses the architecture of feature stores. A key principle is to decouple feature generation from model-specific transformations. The feature store should provide consistent
raw (or lightly processed) features
while model-specific steps like OHE (whose parameters depend on the training data's vocabulary) should be part of the model training pipeline to ensure consistency and appropriateness for the chosen algorithm.
Source: Stanford University
CS 329S: Machine Learning Systems Design
Winter 2021
Lecture 4: "Data Engineering."