Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): My Big Data Notes

June 29, 2012

My Big Data Notes - HDFS, MapReduce

Data - Contains / Represents meaningful information
Big Data - Data that challenges current limits in terms of volumes and speed of data generation

Big Data refers to large data generated from social media, ecommerce sites, data from mobile devices, sensors etc. Data volume generated is huge and the rapid rate of data growth

Big Data Technology refer to technology that helps to process large volumes of data in an economically viable way with latest technologies using commodity hardware

Hadoop ecosystem is the key for Big Data Technology. Hadoop technologies are inspired by Google’s infrastructure

Processing – Mapreduce inspired from Google Map Reduce
File Storage – HDFS inspired from Google File System
Database – HBase - Inspired from Big Table
Pig, Hive - for Analytics

Source - Link

Storage Layer is - HDFS
Processing Layer - Map Reduce

HDFS (Hadoop Distributed File System)

File System Written in Java
High throughout, Effective to read large chunks of data
Runs on Commodity hardware
Fault Tolerant ( Data is replicated, Data not accessible is made available with backups available from it, making it highly available system)
Scalable (Ability to scale - add / remove hardware for file system storage / processing for Map Reduce Jobs)
Run on Diverse Environments

HDFS - Split, Scatter, Replicate and manage data across servers

HDFS Internals

Master Slave Architecture based
One NameNode and multipel DataNodes
NameNode manages data storage, allocation, processing
DataNode - Read / Write Operations performed upon instruction from NameNodes
Status of DataNode is sent as HeartBeats

Map Reduce

Parallel data processing framework
Designed to execute jobs in parallel
Two Phases Mapper and Reducer
Mapper splits jobs into parallel jobs and executes them in parallel
Reduce consolidates the results obtained from individual jobs
Mapper phase need to be completed for Reduce job to start working
Computation is performed on raw data, computation is performed where data is available (Data locality)
Moving code to where data is located is much cheaper, efficient approach for large data volumes

Map Reduce - Distributed, Parallel data processing framework

HBase

NOSQL database hosted on top of HDFS
Columnar based Database
Targeted for Random Reads, Real time query processing
HBase uses HDFS as its data storage layer, This takes care of fault tolerance, scalability aspects

Hive

Targeted for Analytics
Natural choice for SQL Developers is Hive

Pig

Scripting language for analytics queries

In next set of posts we will see in detail about Hbase, Hive and Pig

More Reads

Hadoop: Basic Concepts” webinar recording released

OSSCube releases “The Motivation for Hadoop” webinar resources

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

June 29, 2012

My Big Data Notes - HDFS, MapReduce

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts