Question 1 - Databricks Machine Learning Associate Real Exam Questions [Feb 2026 Update]

Q: 1

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one- hot encoded within the feature repository. Which of the following explanations justifies this suggestion?

Options

Correct Answer:

Explanation

A feature repository, or feature store, is designed to create and manage features for reuse across multiple machine learning models and teams. The choice of feature encoding can be highly dependent on the specific machine learning algorithm being used. One-hot encoding (OHE) creates a high-dimensional, sparse representation of categorical data. While this is necessary for algorithms like linear regression, it can be suboptimal or even detrimental for others, such as tree-based models (e.g., Random Forest, Gradient Boosting), which can handle categorical features more naturally. Storing the raw categorical feature in the repository provides maximum flexibility, allowing each downstream modeling pipeline to apply the most appropriate encoding strategy for its specific algorithm.

Why Incorrect

A. This is factually incorrect. One-hot encoding is a standard, widely supported preprocessing step in all major machine learning libraries, including Scikit-learn and Spark MLlib.

B. This describes target (or mean) encoding. One-hot encoding is an unsupervised transformation that depends only on the values within the feature itself, not the target variable.

C. While OHE can be computationally intensive for high-cardinality features, the primary justification for avoiding it in a feature store is its model-specific nature, not just its computational cost.

D. This is factually incorrect. One-hot encoding is one of the most common and fundamental strategies for numerically representing categorical features for many model types.

References

1. Official Vendor Documentation (Databricks): The Databricks documentation on "Feature engineering in Unity Catalog" emphasizes the principle of reusability. It states

"The feature store provides a centralized repository that enables discovery and reuse of features." Pre-applying a model-specific transformation like OHE would limit this reusability

as not all models benefit from it. This aligns with the principle of keeping features in a more general state within the store.

Source: Databricks Documentation

"Feature engineering in Unity Catalog > What is a feature store?".

2. Academic Publication: In the textbook The Elements of Statistical Learning

the authors explain how different model families handle categorical predictors. Linear models require dummy variables (one-hot encoding)

while tree-based models can handle them natively. This highlights that the encoding strategy is algorithm-dependent.

Source: Hastie

Tibshirani

& Friedman

J. (2009). The Elements of Statistical Learning: Data Mining

Inference

and Prediction. Springer. Chapter 9

Section 9.2.4 "Other Issues

" discusses how tree-based models can handle categorical predictors without creating dummy variables.

3. University Courseware: Stanford's course on Machine Learning Systems Design discusses the architecture of feature stores. A key principle is to decouple feature generation from model-specific transformations. The feature store should provide consistent

raw (or lightly processed) features

while model-specific steps like OHE (whose parameters depend on the training data's vocabulary) should be part of the model training pipeline to ensure consistency and appropriateness for the chosen algorithm.

Source: Stanford University

CS 329S: Machine Learning Systems Design

Winter 2021

Lecture 4: "Data Engineering."

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE