"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 04, 2012

Big Data - Basics - Getting Started

[You may also like - NOSQL - can it replace RDBMS Databases?]

This post is towards learning fundamentals and evolution of big data computing. Based on my discussion with one of my colleague. The quest to find the details of big data. Where is started, why it is needed, What is the current state of Big Data Computing ?


Why Big Data ?
  • Distributed Data processing, Supporting Massive Data (Peta bytes), Scalability were challenges in traditional BI systems (MSSQL, ORACLE and other BI solution providers)
  • An alternative to Transaction processing systems based on ACID properties, fixed Schema design, scalability issues led to evolution of NOSQL Databases, Hadoop based systems
  • Search engines, social networking sites accumulate large amounts of data in a very short time
  • Scalability, flexible schema support, indexing support are properties of NOSQL systems
  • Moving away from traditional ETL based data processing which took alot of time to consolidate various data sources and process large amount of data
Phases Involved in Traditional BI Processing
MSBI
  • ETL Processing
  • Build Data Marts
  • Build Cubes
  • Run SSRS reports 
Phases Involved in Big Data Processing
  • Storage can be Hadoop based / NOSQL Based. It would be useful to check on evolution of Hadoop.
  • Detailed BI processing in Hadoop. Presentation Realtime BI in Hadoop is useful
  • Some important metrics in data processing in big data. Yahoo processed 1 TB data in 16 Secs, 1 PB data in 16 Hours (Source - Link - Slide 29)
  • Hadoop Vs RDBMS (Slide - 17 of presentation is good) 
How Big Data Evolved
  • Everything started with google map/reduce approach. Followed by Hadoop evolution by 2006
  • Yahoo, Facebook, twitter and other major players opted for Hadoop based databases, NOSQL databases
  • Reference - link was useful
How Map reduce works
  • Input data is converted (reduced) into meaningful key / Value pairs
  • Since data from source is in processed (reduced) there is no need to load data and process it at the server level
  • This reduced data is consolidated from various sources is used for Data Analytics / further Data processing (Data Marts etc..)
  • Post is very good note in simple terms to learn Map / Reduce implementation - Map reduce includes (Distributed Data Processing, Data stored as Keys)
  • The program mentioned word count is available in link
How Hadoop works
  • Hadoop is a Framework for Distributed Data Processing
  • Based on Map Reduce Approach
  • Slide 9 of presentation is very good representation of Hadoop setup.
  • The Key components include HBASE for storage, HIVE - Query language for Hadoop
  • SQOOP - Import data from RDBMS systems to Hadoop Clusters, Pig, Avro etc..
  • Good Presentation - link
Summarizing Key points on Hadoop Usage
  • Suitable for Data Mining, Analytics from Unstructured data
  • Not Recommened for RDBMS compliant systems - Banking, OLTP based systems, financial systems etc..
How Microsoft & Oracle play with Big Data
Startups in Big Data space

  • NUODB - Cloud based RDBMS compliant database. Capable of large data processing and a competitive player in big data space
  • SPIRE - Based on Hadoop and HBASE. Real time scalable database
  • Rethink DB - Key Value pair based Storage
  • Emergence of Columnar Database, Vertica Ranked No.1 for Columnar Data
More Reads 
Planning to get started with Hadoop, NuoDB in coming weeks....

Another Excellent Articles Collection List from Wikibon
Big Data: Hadoop, Business Analytics and Beyond
Real-Time Data Management and Analytics Come in Many Flavors
Big Data Market Size and Vendor Revenues
Microsoft is BIG in Big Data

Happy Learning!!!

No comments: