"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 05, 2014

Weekend Learning - Good Session - Taming Big Data with Berkeley Data Analytics Stack

Good Session - Taming Big Data with Berkeley Data Analytics Stack 



Notes captured from the session

Big Data Use Cases (Making personalized decisions for each customer, Analyse data trends)

Data Processing Goals
  • Earlier Trend - Analyse historical data
  • Current Trend - Real time data processing
  • Goal - Sophisticated data processing (Trend analysis, Anomaly detection)
Open Analytics Stack
  • Apps - Data Analysis, Mining, Decision Driven Apps
  • Data Processing - HBase, Hive, Hadoop
  • Storage - HDFS
  • Infrastructure - Cluster
Goals of Open Analytics Stack
  • Support batch, interactive and stream processing
Implementation Notes
  • Store data in memory (SSD's, 512GB of RAM)
  • FB / Yahoo / Bing - Some very large jobs but vast majority are pretty small
  • Aggregating inputs for other jobs fit in memory of cluster
  • Parallelism of jobs, Failure Recovery, Job Scheduling handled
  • Trade-off between accuracy and response time
  • Single execution framework for batch, streaming and interactive computations
New layers added are mentioned in ()
  • Application
  • Data Processing (In Memory Processing)
  • Storage (Data Management Layer), (Resource Management)
  • Infrastructure
  • One cluster for both MPI and Hadoop
  • Spark (Batch & Interactive Apps Support)
  • Spark and Shark are available in Amazon Elastic Map Reduce
  • Tachyon - Storage abstraction
Architecture and Component - Screenshots






Download the components from link
AMP Lab Blog link

Good Session, Happy Learning!!!