"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 28, 2014

Spark Overview

I remember Spark keyword appeared during Big Data Architecture discussions in my Team, I never looked more into Spark. Session by Jyotiska NK on Python + Spark: Lightning Fast Cluster Computing was a useful starter about Spark. (slides of talk)

  • In memory cluster computing framework for large scale data processing
  • Developed using scala with Java + Python APIs
  • This is not meant to replace hadoop. It can sit on top of Hadoop
  • References on Spark Summit for Slides / Videos to learn from past events - link 
Python Offerings
  • PySpark, Data pipeline using spark
  • Spark for real time / batch processing
Spark Vs Map Reduce Differences
This section was session highlighter. They way how data is handled between Map Reduce Execution and Spark Approach is Key.

Map Reduce Approach - Load Data from Disk into RAM, Mapper, Shuffler, Reducer are the different approaches. Processing is distributed. Fault Tolerance is achieved by replicating data 

Spark - Load data in RAM, Keep it until you are done, Data is cached in RAM from disk for iterative processing. If data is too large, rest is spilled into disk. Interactive processing of datasets without having to load data in memory. RDD (Resilent distributed datasets)

RDD - Read Only collection of objects across machines. On losing information this can still be recomputed. 

RDD Operations
  • Transformations - Map, Filter, Sort, flatmap
  • Action - Reduce, Count, Collect, Save to local data in disk. Action usually involves disk operations

More Reads
Testing Spark Best Practices
Gatling - Open Source Perf Test Framework
Spark Paper

Happy Learning!!!

No comments: