Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Big Data Conference Notes

July 31, 2012

Big Data Conference Notes - Part III

#6. Fifth Elephant Conference – Big Data Analytics @ InMobi

I would rate this as the best session in the conference. The journey of inmobi in managing growing data analytics and providing analytics @ real-time is impressive.

Gaurav gave a complete walkthrough from using perl to Hadoop, Pig and finally ended up building their own analytics platform on top of Hadoop

Scale of Data @ InMobi

3 billion impressions per day
100 primary dimensions and 300 derived dimensions
50 measures

Data Characteristics

Highly Dynamic data and Analytic needs
Frequent addition of new dimensions
Dynamic query patterns
Canned and adhoc reports
Different kind of customers (Sales, Analysts, Executives)
Canned Reports – Day in and Day out reports without any change

Journey of Analytics @ Inmobi

Beginning of Analytics

Initially perl scripts
Logs summarised using perl
Perl could not handle increasing data volumes (Q2 2010)

Hadoop Adoption

Map Reduce jobs written to aggregate logs and populate Database
3 machine Hadoop cluster setup was done
Challenging was writing map reduce jobs took a lot of time
With Increasing DB Views this was harden to accommodate with custom MR jobs creation

Hadoop, Map Reduce and Pig

Pig was adopted; Pig was aggregated logs and pushing data into database
For Complex operations custom MR jobs were written

More Analytics, More Data, Growing Measures

Analytics was becoming increasing complex
DB suffered ‘limited angle view’ problems
Hive was not mature when they tried it out. Hive was resource consuming; it was not creating optimal jobs. Data transfer between mapper / reducer was not scalable
Pig jobs were written for new requirements for fetching and loading data in DB

Realisation

He highlighted the challenges in adopting open source frameworks

Too much customization and constant fine tuning required
Difficult to absorb business changes while trying to customize the platform
Different open source framework at different parts of stack, Difficult to integrate and maintain
Pig not suited for business users

YODA (InMobi Inhouse Analytics Framework)

Complete stack was custom built (ETL, Query Processor, Query builder, Visualization)
Built on top of Hadoop
SQL Like operations (sum, select, min, max UDF supported)
Optimized for storage and queries for data model
Protobuf was used for message exchange

Please view the session if you get a chance. It is amazing, Very informative Session

#7. Fifth Elephant Conference – Messaging Architecture @ facebook

10 TB messages / Day
Haystack is used for storing facebook photos
Needle in a haystack: efficient storage of billions of photos

Facebook principle is - “Choose best design not implementation But get things done fast”

LSM Trees

Stores things in a set of trees
High write throughput
Recent data clustered
Inherently snapshotted

Cassandra Vs HBASE

HBase worked out
Cassandra (Distributed Database)
HDFS – Distributed Storage

#8. Fifth Elephant Conference – Recommendation Engine @ Flipkart

Build on top of Hadoop, Cassandra, Redis, Memcache
Cassandra for storing logs
Map reduce jobs run to identify user browse history, common patterns
Identified data stored in redis (key value pair based storage)
Caching is done using memcache

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

July 31, 2012

Big Data Conference Notes - Part III

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts