1. Databricks Documentation
"Pandas API on Spark": In the section on interoperability with Pandas
the documentation warns about this specific behavior. It states
"These conversions require collecting all the data into the driver’s memory. Therefore
you should be careful and only do this on a small subset of data."
Source: Databricks Documentation > Apache Spark > Development > Pandas API on Spark > Working with pandas.
2. Apache Spark™ 3.5.1 Documentation
pyspark.pandas.DataFrame.topandas: The official API documentation for the .topandas() method includes a warning note. It explicitly states
"This method should only be used if the resulting Pandas DataFrame is expected to be small
as all the data is loaded into the driver’s memory."
Source: Apache Spark Documentation > API Docs > PySpark > pyspark.pandas.DataFrame.topandas.
3. Book: "Learning Spark
2nd Edition" by Jules S. Damji
et al. (O'Reilly Media): Chapter 12
which covers the Pandas API on Spark
explains the architectural differences. It highlights that operations like .topandas() trigger a collection of all distributed data to the driver
which is a common source of memory errors when not used cautiously on large DataFrames.
Reference: Chapter 12
Section: "Pandas API on Spark".