Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Spark (immutability)

February 07, 2020

Spark (immutability)

immutability- "unchanging over time or unable to be changed"

Why immutability?

If you do read and write (update) at the same time concurrency is harder to achieve.
Immutability is the way to go for highly concurrent (multithreaded) systems
OLTP is real-time fine-grain updates
OLTP supports transactions / ACID properties on one or bunch of granular updates
Immutable data can as easily live in memory as on disk
Dataframe in Spark - Built on top of RDDs which are immutable in nature. In case of updates/edits DataFrames, it will generate a new data frame instead of updating the existing data frame.

Ref - Link1, Link2

RDDs track lineage info to rebuild lost data
RDDs are fault-tolerant as they track data lineage information
A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data

What is lineage?

From Wikipedia

Data lineage includes the data origin, what happens to it and where it moves over time
Replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output.
Data lineage provides the audit trail of the data points at the highest granular level
Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use lineage to rebuild lost data.Each RDD remembers how the RDD build from other datasets.

Ref - Link
Internals of Spark - Good Read

Happy Learning!!!

No comments:

About Me and Disclaimer

Welcome Visitor,
I have 20 years of experience (Coder - Emprical Learner - Teacher). I am currently working on Data Analytics (Video-Image-Text-Data) / Database / BI space. I dabble with "Data". Ping me or send a request to connect if what I do appeals to you and you want to talk about it (Data Science / Databases / Deep Learning / Architecture / Design Discussions / Consulting Projects/ Machine Learning Training's/ Strategic Leadership Roles).
Personal Goal - Reach / Teach up to 10 Million Students through various mediums (Catalyst between Academics and Industry)
My request to readers, Hope you find the posts, code snippets, notes helpful, please share your learning with others. We can only grow only by learning and teaching.

6+ years in AI, AI experience working on Image, Video, Text, Numbers - Data

15+ years in Databases

10+ in developing, deploying, monitoring large scale solutions in Supply Chain, Retail

Its my personal blog. The objective of this blog is to bookmark/share my learning's. Posts reflect my opinions, perspectives and interests. Blog post presented are my personal views and do not represent my employer's view. I have acknowledged all posts with References/Bookmarks.

For questions/feedback/career opportunities/training / consulting assignments/mentoring - please drop a note to sivaram2k10(at)gmail(dot)com
Coach / Code / Innovate

A blogpost a day keeps your thinking going.

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

February 07, 2020

Spark (immutability)

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts