Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Hadoop Basics

Showing posts with label Hadoop Basics. Show all posts

October 08, 2014

Hive Overview Notes

Data Warehousing package built on top of hadoop
Managing and querying structured data
Apache Derby embedded DB used by Hive
metastore_db folder for persistence of data
Suitable for WORM - Write Once Read Many Times Access Pattern
Core Components are Shell, Metastore, Execution Engine, Compiler (Parse, Plan, Optimize), Driver
Tables can be created as Internal Tables, External Table (Pointing to external file)
When Internal Tables are dropped schema + data is dropped. For external referencing tables only Schema is dropped not data. Both Internal and External tables reside in HDFS
Data files for created tables would be available in location /user/hive/warehouse
Partitioning in Hive - Hash Value % Number of buckets - that particular row will go into that bucket
Partition table should always be an Internal Hive Table

Happy Learning!!!

October 07, 2014

Map Reduce Internals

Client Submits Job. Job Tracker does the splitting, scheduling Job

Mapper

Mapper runs the business logic (ex- word counting)
Mapper (Maps your need from the record)
Record reader provides input to mapper in key value format
Mapper Side Join (Distributed Caching)
Output of mapper (list of keys and values). Output of mapper function stored in Sequence file
Framework does splitting based on input format, Default is new line (text format)
Every row / Record will go through map function
When there is a data split (row) is split between two 64MB Blocks. That particular row would be merged for complete record and processed
Default block size in Hadoop 2.0 is 128MB

Reducer

Reducer will poll it, job tracker will inform what all nodes to poll
Default number of reducer is 1. This is configurable
Multiple Reducers - Not possible - Multiple level MR jobs possible
Reduce Side join (Join @ Reducer Level)

Combiner

Combiner - Mini Reducer, Combiner before writing to disk, finds max value from data
Combiner is used when map job itself can do some preprocessing to minimize reducer workload

Partitioner

Hash Partitioner is default partitioner
Mapper -> Combiner -> Partitioner -> Reducer (For multi-dimension, 2012-max sales by product, 2013, max sales by location)

Happy Learning!!!

October 06, 2014

Hadoop Ecosystem Internals

Hadoop Internals - This post is quick summary from learning session.

Data Copy Basics (Writing data to HDFS)

Network Proximity during Data Storage (First 2 Ips closest to client)
Data Storage size in 64MB Blocks
Data Replication Copy by default 3
Client gets error message when Primary Node Data Write Operation Errors
Blocks will be horizontally split on different machines
Slave uses SSH to connect to master (Communication between Nodes also SSH)
Client communication through RPC
Writing happens parallel-y, replication happens in a pipeline

Analysis / Reads (Reading Data from HDFS)

Client -> Master -> Nearest Ips returned for Nodes
Master knows performance utilization of nodes, It would allocate machine which is least used for Processing (Where data Copy exists)

Concepts

Namenode - Metadata
DataNode - Actual Data
chmod 755 - Owner Write permission, others read and execute
Rack - Physical Set of Machines
Node - Individual machine
Cluster - Set of Racks

Learning Resources

hadoopilluminated
Different NOSQL DBs Comparisions
Apache Knox, Apache Sentry - Security Frameworks for Hadoop
HiBench - Hadoop Benchmarking suite

Tools

Happy Learning!!!

November 06, 2012

Hadoop Learning questions from Quora

Quora has a good collection of questions on Hadoop ecosystem, architecture and beginner level questions. Bookmarking some of interesting questions

Recent Updates

More Reads

Happy Learning!!!

June 29, 2012

My Big Data Notes - HDFS, MapReduce

Data - Contains / Represents meaningful information
Big Data - Data that challenges current limits in terms of volumes and speed of data generation

Big Data refers to large data generated from social media, ecommerce sites, data from mobile devices, sensors etc. Data volume generated is huge and the rapid rate of data growth

Big Data Technology refer to technology that helps to process large volumes of data in an economically viable way with latest technologies using commodity hardware

Hadoop ecosystem is the key for Big Data Technology. Hadoop technologies are inspired by Google’s infrastructure

Processing – Mapreduce inspired from Google Map Reduce
File Storage – HDFS inspired from Google File System
Database – HBase - Inspired from Big Table
Pig, Hive - for Analytics

Source - Link

Storage Layer is - HDFS
Processing Layer - Map Reduce

HDFS (Hadoop Distributed File System)

File System Written in Java
High throughout, Effective to read large chunks of data
Runs on Commodity hardware
Fault Tolerant ( Data is replicated, Data not accessible is made available with backups available from it, making it highly available system)
Scalable (Ability to scale - add / remove hardware for file system storage / processing for Map Reduce Jobs)
Run on Diverse Environments

HDFS - Split, Scatter, Replicate and manage data across servers

HDFS Internals

Master Slave Architecture based
One NameNode and multipel DataNodes
NameNode manages data storage, allocation, processing
DataNode - Read / Write Operations performed upon instruction from NameNodes
Status of DataNode is sent as HeartBeats

Map Reduce

Parallel data processing framework
Designed to execute jobs in parallel
Two Phases Mapper and Reducer
Mapper splits jobs into parallel jobs and executes them in parallel
Reduce consolidates the results obtained from individual jobs
Mapper phase need to be completed for Reduce job to start working
Computation is performed on raw data, computation is performed where data is available (Data locality)
Moving code to where data is located is much cheaper, efficient approach for large data volumes

Map Reduce - Distributed, Parallel data processing framework

HBase

NOSQL database hosted on top of HDFS
Columnar based Database
Targeted for Random Reads, Real time query processing
HBase uses HDFS as its data storage layer, This takes care of fault tolerance, scalability aspects

Hive

Targeted for Analytics
Natural choice for SQL Developers is Hive

Pig

Scripting language for analytics queries

In next set of posts we will see in detail about Hbase, Hive and Pig

More Reads

Hadoop: Basic Concepts” webinar recording released

OSSCube releases “The Motivation for Hadoop” webinar resources

Happy Learning!!!

June 22, 2012

Hadoop Quick Bytes

Hadoop Quick Bytes

Have reviewed couple of youtube sessions on Hadoop Basics. Listed below are short one liners and fundamentals of Hadoop framework

Hadoop - Open Source Framework, Targeted for Batch / Offline Data Processing, Data & I/O intensive Applications

HDFS - Split, Scatter, Replicate and Manage Data across nodes

Map Reduce - Divide tasks, Co-locate parts of data, and manage failure across nodes

Map Reduce is a Paradigm shift

Operate on File Splits
Operate on one block of file
Operate on Key, Value Pair
Processing is not to move data
Move code to where data is available
Data Locality is the key in Map Reduce Programming Approach

HDFS Features

Fault Tolerant - When Nodes Fail, Replicated & Data Distributed is leveraged to recover lost data
Self-Healing - Rebalance Fail, When a task allocated to a node fails, Job is reallocated to another free node
Scalable - Ability to store data in new nodes and participate in executing map reduce jobs

Key Strategy Shift - Map Reduce Job is executed where the data is stored. This is in sharp contrast to traditional ETL process where data is loaded (Delta pull) from production systems, perform data cleansing and loading it for target system for refreshing data marts.

Happy Learning!!!!

June 12, 2012

Hadoop Basics - Part III

Three google research papers to understand MapReduce, BigData evolution

More Reads
Google: Cluster Computing and MapReduce

Happy Learning!!!

June 03, 2012

Hadoop Basics - Part II

[You may also like - Hadoop Basics - Part I]

Next Post is learning about Streaming Data Access in HDFS
Summarising from the link

HDFS is targeted for batch processing
Emphasis is high throughput for Data Reads

This looks ok but how it is achieved ? This question was useful to understand it - What is meant by “streaming data access” in HDFS?

The answer is very easy to interpret and understand. Please find below answer and underlined is important lines from the answer.

Since Data Stored Sequentials and Read Sequentially. Cost of Random Reads, Time to locate Data Node is minimal.

Another beautiful explanation from Google Research Paper - Google File System

Please feel free to add your comments.

Happy Learning!!!

Hadoop Basics - Part I

This post is to get started with Hadoop basics. This post is my notes on Hadoop feature - Fault Tolerance.

HDFS (Hadoop Distributed File System)

One of the key features is Fault Tolerance
Inbuilt capability to handle data failure issues. Multiple copies of same dataset is managed by the system
Once a particular dataset is not accessible system can replace with another accessible copy of same data set

How this is achieved

HDFS is based on Master - Slave Architecture

Master - NameNode (Manages the Data)

Many to 1 relationship between NameNode & DataNode
NameNode manages the Data - How it is stored in DataNodes, How Data is Replicated between DataNodes is managed by NameNode
The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster
Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data
EditLog is stored in the Namenode’s local filesystem

Slave - DataNode (Stores the Data)

A file is split into one or more blocks and set of blocks are stored in DataNodes (Each file is a sequence of blocks)
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
BlockReport contains all the blocks on a Datanode (source - Link1, Link2)

If there is an issue with DataAccess Heartbeat would be a indicator for Data Issues. In such cases NameNode identifies and replaces with replicated DataNode copy available to be used as alternative for inaccessible DataNode.

Happy Learning!!!

October 08, 2014

October 07, 2014

October 06, 2014

November 06, 2012

June 29, 2012

June 22, 2012

June 12, 2012

June 03, 2012

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts