I would rate this as the best session in the conference. The journey of inmobi in managing growing data analytics and providing analytics @ real-time is impressive.
Scale of Data @ InMobi
- 3 billion impressions per day
- 100 primary dimensions and 300 derived dimensions
- 50 measures
Data Characteristics
- Highly Dynamic data and Analytic needs
- Frequent addition of new dimensions
- Dynamic query patterns
- Canned and adhoc reports
- Different kind of customers (Sales, Analysts, Executives)
- Canned Reports – Day in and Day out reports without any change
Journey of Analytics @ Inmobi
Beginning of Analytics
- Initially perl scripts
- Logs summarised using perl
- Perl could not handle increasing data volumes (Q2 2010)
Hadoop Adoption
- Map Reduce jobs written to aggregate logs and populate Database
- 3 machine Hadoop cluster setup was done
- Challenging was writing map reduce jobs took a lot of time
- With Increasing DB Views this was harden to accommodate with custom MR jobs creation
Hadoop, Map Reduce and Pig
- Pig was adopted; Pig was aggregated logs and pushing data into database
- For Complex operations custom MR jobs were written
More Analytics, More Data, Growing Measures
- Analytics was becoming increasing complex
- DB suffered ‘limited angle view’ problems
- Hive was not mature when they tried it out. Hive was resource consuming; it was not creating optimal jobs. Data transfer between mapper / reducer was not scalable
- Pig jobs were written for new requirements for fetching and loading data in DB
Realisation
He highlighted the challenges in adopting open source frameworks
- Too much customization and constant fine tuning required
- Difficult to absorb business changes while trying to customize the platform
- Different open source framework at different parts of stack, Difficult to integrate and maintain
- Pig not suited for business users
YODA (InMobi Inhouse Analytics Framework)
- Complete stack was custom built (ETL, Query Processor, Query builder, Visualization)
- Built on top of Hadoop
- SQL Like operations (sum, select, min, max UDF supported)
- Optimized for storage and queries for data model
- Protobuf was used for message exchange
Please view the session if you get a chance. It is amazing, Very informative Session
#7. Fifth Elephant Conference – Messaging Architecture @ facebook
- 10 TB messages / Day
- Haystack is used for storing facebook photos
- Needle in a haystack: efficient storage of billions of photos
Facebook principle is - “Choose best design not implementation But get things done fast”
LSM Trees
- Stores things in a set of trees
- High write throughput
- Recent data clustered
- Inherently snapshotted
Cassandra Vs HBASE
- HBase worked out
- Cassandra (Distributed Database)
- HDFS – Distributed Storage
- Build on top of Hadoop, Cassandra, Redis, Memcache
- Cassandra for storing logs
- Map reduce jobs run to identify user browse history, common patterns
- Identified data stored in redis (key value pair based storage)
- Caching is done using memcache
Happy Learning!!!