Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Weekend Reading

November 28, 2020

Paper Link

Key notes

Resilient Distributed Datasets (RDDs) - a distributed memory abstraction that lets programmers perform in-memory computations on large cluster
Data reuse is common in many iterative machine learning and graph algorithms, including PageRank, K-means clustering, and logistic regression.
RDDs are a good fit for many parallel applications because these applications naturally apply the same operation to multiple data items.
RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement
If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute
Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs
RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage
The main difference between RDDs and DSM is that RDDs can only be created (“written”) through coarse-grained transformations, while DSM allows reads and writes to each memory location

Keep Thinking!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)