This post is towards learning fundamentals and evolution of big data computing. Based on my discussion with one of my colleague. The quest to find the details of big data. Where is started, why it is needed, What is the current state of Big Data Computing ?
Why Big Data ?
- Distributed Data processing, Supporting Massive Data (Peta bytes), Scalability were challenges in traditional BI systems (MSSQL, ORACLE and other BI solution providers)
- An alternative to Transaction processing systems based on ACID properties, fixed Schema design, scalability issues led to evolution of NOSQL Databases, Hadoop based systems
- Search engines, social networking sites accumulate large amounts of data in a very short time
- Scalability, flexible schema support, indexing support are properties of NOSQL systems
- Moving away from traditional ETL based data processing which took alot of time to consolidate various data sources and process large amount of data
Phases Involved in Traditional BI Processing
- ETL Processing
- Build Data Marts
- Build Cubes
- Run SSRS reports
Phases Involved in Big Data Processing
- Storage can be Hadoop based / NOSQL Based. It would be useful to check on evolution of Hadoop.
- Detailed BI processing in Hadoop. Presentation Realtime BI in Hadoop is useful
- Some important metrics in data processing in big data. Yahoo processed 1 TB data in 16 Secs, 1 PB data in 16 Hours (Source - Link - Slide 29)
- Hadoop Vs RDBMS (Slide - 17 of presentation is good)
- Everything started with google map/reduce approach. Followed by Hadoop evolution by 2006
- Yahoo, Facebook, twitter and other major players opted for Hadoop based databases, NOSQL databases
- Reference - link was useful
- Input data is converted (reduced) into meaningful key / Value pairs
- Since data from source is in processed (reduced) there is no need to load data and process it at the server level
- This reduced data is consolidated from various sources is used for Data Analytics / further Data processing (Data Marts etc..)
- Post is very good note in simple terms to learn Map / Reduce implementation - Map reduce includes (Distributed Data Processing, Data stored as Keys)
- The program mentioned word count is available in link
How Hadoop works
- Hadoop is a Framework for Distributed Data Processing
- Based on Map Reduce Approach
- Slide 9 of presentation is very good representation of Hadoop setup.
- The Key components include HBASE for storage, HIVE - Query language for Hadoop
- SQOOP - Import data from RDBMS systems to Hadoop Clusters, Pig, Avro etc..
- Good Presentation - link
Summarizing Key points on Hadoop Usage
- Suitable for Data Mining, Analytics from Unstructured data
- Not Recommened for RDBMS compliant systems - Banking, OLTP based systems, financial systems etc..
How Microsoft & Oracle play with Big Data
- Big Data - A Microsoft Tools Approach. Microsoft Support for Hadoop
- ORACLE has launched its own version of NOSQL databases
Startups in Big Data space
- NUODB - Cloud based RDBMS compliant database. Capable of large data processing and a competitive player in big data space
- SPIRE - Based on Hadoop and HBASE. Real time scalable database
- Rethink DB - Key Value pair based Storage
- Emergence of Columnar Database, Vertica Ranked No.1 for Columnar Data
More Reads
- NEW! Wiki launched for Apache Hadoop on Windows Azure
- Big data cloud platforms compared (Very Good Comparision)
- NoSQL and Big Data Analytics Roundtable Takeaways
- Big Data: Big Pain or Big Profits?
- The Right Tool for the Job: Using Hadoop with Vertica for Big Data Analytics
- Hadoop VS MSBI
- Misconceptions about Big Data
- Hadoop World 2010: HBase in Production at Facebook
- Data Processing with Hadoop: Scalable and Cost Effective, Doug Cutting, Apache Hadoop Co-founder
- Facebook Architecture
- Hadoop Development at Facebook
- Rethinking Database with Hadoop and Hive
- Hadoop Training: Programming with Hadoop
- Hadoop Training: MapReduce Algorithms
- Cassandra By Example
- NoSQL Ecosystem
- MapR Academy, Free Hadoop Training
- Secrets Revealed in Columnar Database Technology
- The Art of Big Data
- A Compendium of solutions for scaling a Data Store
- Update: Microsoft, Hadoop and Big Data
Another Excellent Articles Collection List from Wikibon
Big Data: Hadoop, Business Analytics and Beyond
Real-Time Data Management and Analytics Come in Many Flavors
Big Data Market Size and Vendor Revenues
Microsoft is BIG in Big Data
Happy Learning!!!
No comments:
Post a Comment