"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

July 31, 2012

Big Data Conference Notes - Part III

#6. Fifth Elephant Conference – Big Data Analytics @ InMobi

I would rate this as the best session in the conference. The journey of inmobi in managing growing data analytics and providing analytics @ real-time is impressive.
Gaurav gave a complete walkthrough from using perl to Hadoop, Pig and finally ended up building their own analytics platform on top of Hadoop
Scale of Data @ InMobi

  • 3 billion impressions per day
  • 100 primary dimensions and 300 derived dimensions
  • 50 measures
Data Characteristics
  • Highly Dynamic data and Analytic needs
  • Frequent addition of new dimensions
  • Dynamic query patterns
  • Canned and adhoc reports
  • Different kind of customers (Sales, Analysts, Executives)
  • Canned Reports – Day in and Day out reports without any change
Journey of Analytics @ Inmobi
Beginning of Analytics
  • Initially perl scripts
  • Logs summarised using perl
  • Perl could not handle increasing data volumes (Q2 2010)
Hadoop Adoption
  • Map Reduce jobs written to aggregate logs and populate Database
  • 3 machine Hadoop cluster setup was done
  • Challenging was writing map reduce jobs took a lot of time
  • With Increasing DB Views this was harden to accommodate with custom MR jobs creation
Hadoop, Map Reduce and Pig
  • Pig was adopted; Pig was aggregated logs and pushing data into database
  • For Complex operations custom MR jobs were written
More Analytics, More Data, Growing Measures
  • Analytics was becoming increasing complex
  • DB suffered ‘limited angle view’ problems
  • Hive was not mature when they tried it out. Hive was resource consuming; it was not creating optimal jobs. Data transfer between mapper / reducer was not scalable
  • Pig jobs were written for new requirements for fetching and loading data in DB

He highlighted the challenges in adopting open source frameworks
  • Too much customization and constant fine tuning required
  • Difficult to absorb business changes while trying to customize the platform
  • Different open source framework at different parts of stack, Difficult to integrate and maintain
  • Pig not suited for business users

YODA (InMobi Inhouse Analytics Framework)
  • Complete stack was custom built (ETL, Query Processor, Query builder, Visualization)
  • Built on top of Hadoop
  • SQL Like operations (sum, select, min, max UDF supported)
  • Optimized for storage and queries for data model
  • Protobuf was used for message exchange

Please view the session if you get a chance. It is amazing, Very informative Session

#7. Fifth Elephant Conference – Messaging Architecture @ facebook

Facebook principle is - “Choose best design not implementation But get things done fast”
LSM Trees
  • Stores things in a set of trees
  • High write throughput
  • Recent data clustered
  • Inherently snapshotted
Cassandra Vs HBASE
  • HBase worked out
  • Cassandra (Distributed Database)
  • HDFS – Distributed Storage
#8. Fifth Elephant Conference – Recommendation Engine @ Flipkart
  • Build on top of Hadoop, Cassandra, Redis, Memcache
  • Cassandra for storing logs
  • Map reduce jobs run to identify user browse history, common patterns
  • Identified data stored in redis (key value pair based storage)
  • Caching is done using memcache

Happy Learning!!!

No comments: