Hadoop - Open Source Framework, Targeted for Batch / Offline Data Processing, Data & I/O intensive Applications
HDFS - Split, Scatter, Replicate and Manage Data across nodes
Map Reduce - Divide tasks, Co-locate parts of data, and manage failure across nodes
Map Reduce is a Paradigm shift
- Operate on File Splits
- Operate on one block of file
- Operate on Key, Value Pair
- Processing is not to move data
- Move code to where data is available
- Data Locality is the key in Map Reduce Programming Approach
- Fault Tolerant - When Nodes Fail, Replicated & Data Distributed is leveraged to recover lost data
- Self-Healing - Rebalance Fail, When a task allocated to a node fails, Job is reallocated to another free node
- Scalable - Ability to store data in new nodes and participate in executing map reduce jobs
Key Strategy Shift - Map Reduce Job is executed where the data is stored. This is in sharp contrast to traditional ETL process where data is loaded (Delta pull) from production systems, perform data cleansing and loading it for target system for refreshing data marts.
Happy Learning!!!!
No comments:
Post a Comment