"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 22, 2012

Hadoop Quick Bytes

Hadoop Quick Bytes
Have reviewed couple of youtube sessions on Hadoop Basics. Listed below are short one liners and fundamentals of Hadoop framework

Hadoop - Open Source Framework, Targeted for Batch / Offline Data Processing, Data & I/O intensive Applications

HDFS - Split, Scatter, Replicate and Manage Data across nodes

Map Reduce - Divide tasks, Co-locate parts of data, and manage failure across nodes

Map Reduce is a Paradigm shift 
  • Operate on File Splits
  • Operate on one block of file
  • Operate on Key, Value Pair
  • Processing is not to move data
  • Move code to where data is available
  • Data Locality is the key in Map Reduce Programming Approach
HDFS Features
  • Fault Tolerant - When Nodes Fail, Replicated & Data Distributed is leveraged to recover lost data
  • Self-Healing - Rebalance Fail, When a task allocated to a node fails, Job is reallocated to another free node 
  • Scalable - Ability to store data in new nodes and participate in executing map reduce jobs

Key Strategy Shift - Map Reduce Job is executed where the data is stored. This is in sharp contrast to traditional ETL process where data is loaded (Delta pull) from production systems, perform data cleansing and loading it for target system for refreshing data marts.

Happy Learning!!!!

No comments: