Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Big Data - Basics

March 04, 2012

Big Data - Basics - Getting Started

[You may also like - NOSQL - can it replace RDBMS Databases?]

This post is towards learning fundamentals and evolution of big data computing. Based on my discussion with one of my colleague. The quest to find the details of big data. Where is started, why it is needed, What is the current state of Big Data Computing ?

Why Big Data ?

Distributed Data processing, Supporting Massive Data (Peta bytes), Scalability were challenges in traditional BI systems (MSSQL, ORACLE and other BI solution providers)
An alternative to Transaction processing systems based on ACID properties, fixed Schema design, scalability issues led to evolution of NOSQL Databases, Hadoop based systems
Search engines, social networking sites accumulate large amounts of data in a very short time
Scalability, flexible schema support, indexing support are properties of NOSQL systems
Moving away from traditional ETL based data processing which took alot of time to consolidate various data sources and process large amount of data

Phases Involved in Traditional BI Processing

MSBI

ETL Processing
Build Data Marts
Build Cubes
Run SSRS reports

Phases Involved in Big Data Processing

Storage can be Hadoop based / NOSQL Based. It would be useful to check on evolution of Hadoop.
Detailed BI processing in Hadoop. Presentation Realtime BI in Hadoop is useful
Some important metrics in data processing in big data. Yahoo processed 1 TB data in 16 Secs, 1 PB data in 16 Hours (Source - Link - Slide 29)
Hadoop Vs RDBMS (Slide - 17 of presentation is good)

How Big Data Evolved

Everything started with google map/reduce approach. Followed by Hadoop evolution by 2006
Yahoo, Facebook, twitter and other major players opted for Hadoop based databases, NOSQL databases
Reference - link was useful

How Map reduce works

Input data is converted (reduced) into meaningful key / Value pairs
Since data from source is in processed (reduced) there is no need to load data and process it at the server level
This reduced data is consolidated from various sources is used for Data Analytics / further Data processing (Data Marts etc..)
Post is very good note in simple terms to learn Map / Reduce implementation - Map reduce includes (Distributed Data Processing, Data stored as Keys)
The program mentioned word count is available in link

How Hadoop works

Hadoop is a Framework for Distributed Data Processing
Based on Map Reduce Approach
Slide 9 of presentation is very good representation of Hadoop setup.
The Key components include HBASE for storage, HIVE - Query language for Hadoop
SQOOP - Import data from RDBMS systems to Hadoop Clusters, Pig, Avro etc..
Good Presentation - link

Summarizing Key points on Hadoop Usage

Suitable for Data Mining, Analytics from Unstructured data
Not Recommened for RDBMS compliant systems - Banking, OLTP based systems, financial systems etc..

How Microsoft & Oracle play with Big Data

Big Data - A Microsoft Tools Approach. Microsoft Support for Hadoop
ORACLE has launched its own version of NOSQL databases

Startups in Big Data space

NUODB - Cloud based RDBMS compliant database. Capable of large data processing and a competitive player in big data space
SPIRE - Based on Hadoop and HBASE. Real time scalable database
Rethink DB - Key Value pair based Storage
Emergence of Columnar Database, Vertica Ranked No.1 for Columnar Data

More Reads

Planning to get started with Hadoop, NuoDB in coming weeks....

Another Excellent Articles Collection List from Wikibon
Big Data: Hadoop, Business Analytics and Beyond
Real-Time Data Management and Analytics Come in Many Flavors
Big Data Market Size and Vendor Revenues
Microsoft is BIG in Big Data

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

March 04, 2012

Big Data - Basics - Getting Started

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts