"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 07, 2020

Spark (immutability)

immutability- "unchanging over time or unable to be changed"

Why immutability?
  • If you do read and write (update) at the same time concurrency is harder to achieve. 
  • Immutability is the way to go for highly concurrent (multithreaded) systems
  • OLTP is real-time fine-grain updates
  • OLTP supports transactions / ACID properties on one or bunch of granular updates
  • Immutable data can as easily live in memory as on disk
  • Dataframe in Spark - Built on top of RDDs which are immutable in nature. In case of updates/edits DataFrames, it will generate a new data frame instead of updating the existing data frame.
Ref - Link1, Link2
  • RDDs track lineage info to rebuild lost data
  • RDDs are fault-tolerant as they track data lineage information
  • A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data
What is lineage?

From Wikipedia
  • Data lineage includes the data origin, what happens to it and where it moves over time
  • Replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output.
  • Data lineage provides the audit trail of the data points at the highest granular level
  • Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use lineage to rebuild lost data.Each RDD remembers how the RDD build from other datasets.
Ref - Link
Internals of Spark - Good Read

Happy Learning!!!


No comments: