"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 28, 2020

Weekend Reading - Resilient Distributed Datasets

 Paper Link 

Key notes
  • Resilient Distributed Datasets (RDDs) - a distributed memory abstraction that lets programmers perform in-memory computations on large cluster
  • Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression. 
  • RDDs are a good fit for many parallel applications because these applications naturally apply the same operation to multiple data items. 
  • RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement
  • If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute
  • Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs
  • RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage
  • The main difference between RDDs and DSM is that RDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads and writes to each memory location




Keep Thinking!!!

No comments: