"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 29, 2012

My Big Data Notes - HDFS, MapReduce

Data - Contains / Represents meaningful information
Big Data - Data that challenges current limits in terms of volumes and speed of data generation

Big Data refers to large data generated from social media, ecommerce sites, data from mobile devices, sensors etc. Data volume generated is huge and the rapid rate of data growth
Big Data Technology refer to technology that helps to process large volumes of data in an economically viable way with latest technologies using commodity hardware

Hadoop ecosystem is the key for Big Data Technology. Hadoop technologies are inspired by Google’s infrastructure
  • Processing – Mapreduce inspired from Google Map Reduce
  • File Storage – HDFS inspired from Google File System
  • Database – HBase - Inspired from Big Table
  • Pig, Hive - for Analytics

  • Storage Layer is - HDFS
  • Processing Layer - Map Reduce
HDFS (Hadoop Distributed File System) 

  • File System Written in Java
  • High throughout, Effective to read large chunks of data
  • Runs on Commodity hardware 
  • Fault Tolerant ( Data is replicated, Data not accessible is made available with backups available from it, making it highly available system)
  • Scalable (Ability to scale - add / remove hardware for file system storage / processing for Map Reduce Jobs)
  • Run on Diverse Environments
HDFS - Split, Scatter, Replicate and manage data across servers
HDFS Internals 
  • Master Slave Architecture based
  • One NameNode and multipel DataNodes
  • NameNode manages data storage, allocation, processing
  • DataNode - Read / Write Operations performed upon instruction from NameNodes
  • Status of DataNode is sent as HeartBeats
Map Reduce 
  • Parallel data processing framework
  • Designed to execute jobs in parallel
  • Two Phases Mapper and Reducer
  • Mapper splits jobs into parallel jobs and executes them in parallel
  • Reduce consolidates the results obtained from individual jobs
  • Mapper phase need to be completed for Reduce job to start working
  • Computation is performed on raw data, computation is performed where data is available (Data locality)
  • Moving code to where data is located is much cheaper, efficient approach for large data volumes
Map Reduce - Distributed, Parallel data processing framework


  • NOSQL database hosted on top of HDFS
  • Columnar based Database
  • Targeted for Random Reads, Real time query processing
  • HBase uses HDFS as its data storage layer, This takes care of fault tolerance, scalability aspects
  • Targeted for Analytics
  • Natural choice for SQL Developers is Hive 
  • Scripting language for analytics queries
In next set of posts we will see in detail about Hbase, Hive and Pig
More Reads
Happy Learning!!!

No comments: