Question 9 - Databricks Machine Learning Associate Real Exam Questions [Feb 2026 Update]

Q: 9

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables. Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options

Correct Answer:

Explanation

For large-scale datasets, solving the linear regression problem with a direct analytical method like the normal equation (which involves matrix decomposition) is computationally infeasible. The X^T X matrix can become too large to fit in memory or too expensive to invert.

To overcome this, Spark ML's LinearRegression implementation utilizes iterative optimization algorithms. These methods, such as L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno), start with an initial guess for the model's weights and repeatedly update them in a direction that minimizes the loss function. The core computations, like calculating the gradient, can be efficiently distributed across the cluster, making this approach highly scalable for large data.

Why Incorrect

A. Logistic regression is a distinct algorithm used for classification tasks, not a method for training a linear regression model.

B. This is false. The primary purpose of Spark ML is to provide scalable, distributed implementations of machine learning algorithms, including linear regression.

D. The least-squares method describes the objective function (minimizing the sum of squared errors) that linear regression solves, not the computational strategy for distributed training.

E. Singular value decomposition is a matrix decomposition technique. The question's premise correctly states that such methods do not scale well for this problem.

References

1. Official Apache Spark Documentation: The documentation for LinearRegression in MLlib explicitly details the optimization algorithms used. It states

"The implementation is based on the MLlib LBFGS optimizer for L2-regularized linear regression... For unregularized linear regression

the implementation uses a wrapper for the NormalEquation and Cholesky solvers... The normal equation solver is limited to at most 4096 features." This confirms that for a large number of variables (beyond 4096)

the iterative L-BFGS optimizer is the method employed.

Source: Apache Spark 3.5.0 MLlib Guide

Classification and regression > Linear methods > Linear regression > Mathematical detail.

2. Academic Publication: The foundational paper on Spark's machine learning library highlights the design choice of using iterative methods for scalability. "For many ML algorithms

we can express the optimization problem as a sum of loss terms... and solve it with gradient descent. The gradient can be computed on a cluster by summing gradients computed on subsets of the data in parallel... MLlib has a general-purpose gradient descent optimizer." L-BFGS is a more advanced quasi-Newton iterative optimization method built on these principles.

Source: Meng

Bradley

Yavuz

Sparks

Venkataraman

Liu

... & Talwalkar

A. (2016). MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research

17(34)

1-7. (Section 3.1: Implementation).

3. Official Databricks Documentation: The Databricks documentation on Linear Regression also confirms the use of iterative optimization. "The training algorithm uses the L-BFGS optimizer." This directly points to an iterative optimization approach as the standard for their implementation.

Source: Databricks Documentation

Machine Learning Guide > MLflow > MLflow models > spark.mllib > Linear Regression.

Premium Access Includes

FLASH OFFER

avail 10% DISCOUNT on YOUR PURCHASE