1. Databricks Documentation
"Pandas API on Spark": "The pandas API on Spark allows you to scale your pandas workload to any size by running it distributed on a Spark cluster. If you are already familiar with pandas
you can be immediately productive with the pandas API on Spark...". This supports the combination of scalability (distributed on a cluster) and the use of a familiar
feature-rich API.
2. Apache Spark™ 3.5.1 Documentation
"pyspark.pandas": "This project makes data scientists more productive when interacting with big data
by implementing the pandas DataFrame API on top of Apache Spark." This directly states the goal is to provide the pandas API on top of Spark for big data
which aligns with executing queries across a cluster using familiar features.
3. Karau
H.
Konwinski
A.
Wendell
P.
& Zaharia
M. (2015). Learning Spark: Lightning-Fast Big Data Analysis. O'Reilly Media
Inc. (Conceptual basis): Chapter 1 discusses the limitations of single-machine processing (like standard pandas) and introduces Spark's model of distributed computation across a cluster as the solution for big data. The Pandas API on Spark is a direct application of this principle
providing a high-level API over the distributed engine.