Notes captured from the session
Big Data Use Cases (Making personalized decisions for each customer, Analyse data trends)
Data Processing Goals
- Earlier Trend - Analyse historical data
- Current Trend - Real time data processing
- Goal - Sophisticated data processing (Trend analysis, Anomaly detection)
Open Analytics Stack
- Apps - Data Analysis, Mining, Decision Driven Apps
- Data Processing - HBase, Hive, Hadoop
- Storage - HDFS
- Infrastructure - Cluster
Goals of Open Analytics Stack
- Support batch, interactive and stream processing
Implementation Notes
- Store data in memory (SSD's, 512GB of RAM)
- FB / Yahoo / Bing - Some very large jobs but vast majority are pretty small
- Aggregating inputs for other jobs fit in memory of cluster
- Parallelism of jobs, Failure Recovery, Job Scheduling handled
- Trade-off between accuracy and response time
- Single execution framework for batch, streaming and interactive computations
New layers added are mentioned in ()
- Application
- Data Processing (In Memory Processing)
- Storage (Data Management Layer), (Resource Management)
- Infrastructure
- One cluster for both MPI and Hadoop
- Spark (Batch & Interactive Apps Support)
- Spark and Shark are available in Amazon Elastic Map Reduce
- Tachyon - Storage abstraction
Architecture and Component - Screenshots
Good Session, Happy Learning!!!