Paper Link
- Resilient Distributed Datasets (RDDs) - a distributed memory abstraction that lets programmers perform in-memory computations on large cluster
- Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression.
- RDDs are a good fit for many parallel applications because these applications naturally apply the same operation to multiple data items.
- RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement
- If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute
- Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs
- RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage
- The main difference between RDDs and DSM is that RDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads and writes to each memory location