"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Hadoop Basics. Show all posts
Showing posts with label Hadoop Basics. Show all posts

October 08, 2014

Hive Overview Notes

  • Data Warehousing package built on top of hadoop
  • Managing and querying structured data
  • Apache Derby embedded DB used by Hive
  • metastore_db folder for persistence of data
  • Suitable for WORM - Write Once Read Many Times Access Pattern
  • Core Components are Shell, Metastore, Execution Engine, Compiler (Parse, Plan, Optimize), Driver
  • Tables can be created as Internal Tables, External Table (Pointing to external file)
  • When Internal Tables are dropped schema + data is dropped. For external referencing tables only Schema is dropped not data. Both Internal and External tables reside in HDFS
  • Data files for created tables would be available in location /user/hive/warehouse
  • Partitioning in Hive - Hash Value % Number of buckets - that particular row will go into that bucket
  • Partition table should always be an Internal Hive Table
Happy Learning!!!

October 07, 2014

Map Reduce Internals

Client Submits Job. Job Tracker does the splitting, scheduling Job

Mapper
  • Mapper runs the business logic (ex- word counting)
  • Mapper (Maps your need from the record)
  • Record reader provides input to mapper in key value format
  • Mapper Side Join (Distributed Caching)
  • Output of mapper (list of keys and values). Output of mapper function stored in Sequence file
  • Framework does splitting based on input format, Default is new line (text format)
  • Every row / Record will go through map function
  • When there is a data split (row) is split between two 64MB Blocks. That particular row would be merged for complete record and processed
  • Default block size in Hadoop 2.0 is 128MB
Reducer
  • Reducer will poll it, job tracker will inform what all nodes to poll
  • Default number of reducer is 1. This is configurable
  • Multiple Reducers - Not possible - Multiple level MR jobs possible
  • Reduce Side join (Join @ Reducer Level)
Combiner
  • Combiner - Mini Reducer, Combiner before writing to disk, finds max value from data
  • Combiner is used when map job itself can do some preprocessing to minimize reducer workload
Partitioner
  • Hash Partitioner is default partitioner
  • Mapper -> Combiner -> Partitioner -> Reducer (For multi-dimension, 2012-max sales by product, 2013, max sales by location)
Happy Learning!!!

October 06, 2014

Hadoop Ecosystem Internals

Hadoop Internals - This post is quick summary from learning session.

Data Copy Basics (Writing data to HDFS)
  • Network Proximity during Data Storage (First 2 Ips closest to client)
  • Data Storage size in 64MB Blocks
  • Data Replication Copy by default 3
  • Client gets error message when Primary Node Data Write Operation Errors
  • Blocks will be horizontally split on different machines
  • Slave uses SSH to connect to master (Communication between Nodes also SSH)
  • Client communication through RPC
  • Writing happens parallel-y, replication happens in a pipeline
Analysis / Reads (Reading Data from HDFS)
  • Client -> Master -> Nearest Ips returned for Nodes
  • Master knows performance utilization of nodes, It would allocate machine which is least used for Processing (Where data Copy exists)
Concepts
  • Namenode - Metadata
  • DataNode - Actual Data
  • chmod 755 - Owner Write permission, others read and execute
  • Rack - Physical Set of Machines
  • Node - Individual machine
  • Cluster - Set of Racks
Learning Resources
Tools
Happy Learning!!!

June 29, 2012

My Big Data Notes - HDFS, MapReduce

Data - Contains / Represents meaningful information
Big Data - Data that challenges current limits in terms of volumes and speed of data generation

Big Data refers to large data generated from social media, ecommerce sites, data from mobile devices, sensors etc. Data volume generated is huge and the rapid rate of data growth
Big Data Technology refer to technology that helps to process large volumes of data in an economically viable way with latest technologies using commodity hardware

Hadoop ecosystem is the key for Big Data Technology. Hadoop technologies are inspired by Google’s infrastructure
  • Processing – Mapreduce inspired from Google Map Reduce
  • File Storage – HDFS inspired from Google File System
  • Database – HBase - Inspired from Big Table
  • Pig, Hive - for Analytics

SourceLink
  • Storage Layer is - HDFS
  • Processing Layer - Map Reduce
HDFS (Hadoop Distributed File System) 

  • File System Written in Java
  • High throughout, Effective to read large chunks of data
  • Runs on Commodity hardware 
  • Fault Tolerant ( Data is replicated, Data not accessible is made available with backups available from it, making it highly available system)
  • Scalable (Ability to scale - add / remove hardware for file system storage / processing for Map Reduce Jobs)
  • Run on Diverse Environments
HDFS - Split, Scatter, Replicate and manage data across servers
HDFS Internals 
  • Master Slave Architecture based
  • One NameNode and multipel DataNodes
  • NameNode manages data storage, allocation, processing
  • DataNode - Read / Write Operations performed upon instruction from NameNodes
  • Status of DataNode is sent as HeartBeats
Map Reduce 
  • Parallel data processing framework
  • Designed to execute jobs in parallel
  • Two Phases Mapper and Reducer
  • Mapper splits jobs into parallel jobs and executes them in parallel
  • Reduce consolidates the results obtained from individual jobs
  • Mapper phase need to be completed for Reduce job to start working
  • Computation is performed on raw data, computation is performed where data is available (Data locality)
  • Moving code to where data is located is much cheaper, efficient approach for large data volumes
Map Reduce - Distributed, Parallel data processing framework


HBase 

  • NOSQL database hosted on top of HDFS
  • Columnar based Database
  • Targeted for Random Reads, Real time query processing
  • HBase uses HDFS as its data storage layer, This takes care of fault tolerance, scalability aspects
Hive
  • Targeted for Analytics
  • Natural choice for SQL Developers is Hive 
Pig
  • Scripting language for analytics queries
In next set of posts we will see in detail about Hbase, Hive and Pig
 
More Reads
 
 
Happy Learning!!!
  

June 22, 2012

Hadoop Quick Bytes

Hadoop Quick Bytes
 
Have reviewed couple of youtube sessions on Hadoop Basics. Listed below are short one liners and fundamentals of Hadoop framework

Hadoop - Open Source Framework, Targeted for Batch / Offline Data Processing, Data & I/O intensive Applications

HDFS - Split, Scatter, Replicate and Manage Data across nodes

Map Reduce - Divide tasks, Co-locate parts of data, and manage failure across nodes

Map Reduce is a Paradigm shift 
  • Operate on File Splits
  • Operate on one block of file
  • Operate on Key, Value Pair
  • Processing is not to move data
  • Move code to where data is available
  • Data Locality is the key in Map Reduce Programming Approach
 
HDFS Features
  • Fault Tolerant - When Nodes Fail, Replicated & Data Distributed is leveraged to recover lost data
  • Self-Healing - Rebalance Fail, When a task allocated to a node fails, Job is reallocated to another free node 
  • Scalable - Ability to store data in new nodes and participate in executing map reduce jobs

Key Strategy Shift - Map Reduce Job is executed where the data is stored. This is in sharp contrast to traditional ETL process where data is loaded (Delta pull) from production systems, perform data cleansing and loading it for target system for refreshing data marts.


Happy Learning!!!!
 

June 03, 2012

Hadoop Basics - Part II

[You may also like - Hadoop Basics - Part I]

Next Post is learning about Streaming Data Access in HDFS
Summarising from the link
  • HDFS is targeted for batch processing
  • Emphasis is high throughput for Data Reads
This looks ok but how it is achieved ? This question was useful to understand it - What is meant by “streaming data access” in HDFS?

The answer is very easy to interpret and understand. Please find below answer and underlined is important lines from the answer.




Since Data Stored Sequentials and Read Sequentially. Cost of Random Reads, Time to locate Data Node is minimal.

Another beautiful explanation from Google Research Paper - Google File System


Please feel free to add your comments.
Happy Learning!!!

Hadoop Basics - Part I


This post is to get started with Hadoop basics. This post is my notes on Hadoop feature - Fault Tolerance.

HDFS (Hadoop Distributed File System) 
  • One of the key features is Fault Tolerance
  • Inbuilt capability to handle data failure issues. Multiple copies of same dataset is managed by the system
  • Once a particular dataset is not accessible system can replace with another accessible copy of same data set 
How this is achieved

HDFS is based on Master - Slave Architecture

Master - NameNode (Manages the Data)
  • Many to 1 relationship between NameNode & DataNode
  • NameNode manages the Data - How it is stored in DataNodes, How Data is Replicated between DataNodes is managed by NameNode
  • The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster
  • Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data
  • EditLog is stored in the Namenode’s local filesystem

Slave - DataNode (Stores the Data)
  • A file is split into one or more blocks and set of blocks are stored in DataNodes (Each file is a sequence of blocks)
  • DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • BlockReport contains all the blocks on a Datanode (source - Link1, Link2
If there is an issue with DataAccess Heartbeat would be a indicator for Data Issues. In such cases NameNode identifies and replaces with replicated DataNode copy available to be used as alternative for inaccessible DataNode.
 
 
Happy Learning!!!