1. Zaharia
M.
Chowdhury
M.
Franklin
M. J.
Shenker
S.
& Stoica
I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association. Section 3.1
"RDD Abstraction
" describes how datasets are partitioned across machines in a cluster. Increasing the number of machines (nodes) allows for a greater distribution of these partitions
thus increasing the total memory capacity for the dataset.
2. MIT OpenCourseWare. (2020). 6.824 Distributed Systems
Spring 2020. Lecture 3: GFS. The lecture discusses how Google File System (and by extension
other distributed systems like MapReduce) achieves scalability for large datasets by distributing data and computation across a large number of commodity machines (nodes). Adding nodes is the primary mechanism for scaling.
3. Armbrust
M.
et al. (2015). Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). Association for Computing Machinery
New York
NY
USA
1383–1394. Section 2
"Programming Model
" explains how Spark's distributed execution allows it to scale "by adding more machines to a cluster." This directly addresses resource constraints like memory. DOI: https://doi.org/10.1145/2723372.2742797