"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

July 31, 2012

Big Data Conference Notes - Part II

In Continuation with previous post.

#4. Fifth Elephant Conference –Cloud Story for Big Data by AWS Evangelist Joe Ziegler


This was a beginner session; there was not in-depth discussion on tools / architecture approach. Amazon is undisputed leader right now. The underlying tools/ techniques are now applied by other competitors Azure, VMware Cloud, Google Cloud. After Google, Amazon harnessed the power of Hadoop, Map Reduce (Elastic Mapreduce), S3 Storage and provide it on AWS Platform.


Data becomes so large that you need to innovate to store, process it. Bigger data is harder data, Multiple sources and multiple different formats of data. By end of 2012 2.7 Zeta bytes of data will be generated and 90% of it is unstructured.


Why Cloud ?
  • Elastic (Spin off machines on need basis
  • Pay per use
  • No Capital investment
  • Faster time to market
  • Focus on Core Complexity
Cloud benefits 
  • Reusable – Deploy take snapshot from cloud and use it to deploy later
  • Managed Services – Managed hosted Hadoop environment @ Amazon
  • Scale 
  • Innovation
Cloud reduces cost of experimentation

S3 – Simple storage service
Beginner level session, Big Data and Cloud are best friend. Cloud provides infrastructure to host / run big data infrastructure. AWS offerings, Customer case studies were highlighted
#5. Fifth Elephant Conference – Real time Analytics @ flipkart
They explained custom real time analytics for supply chain orders / procurements etc.. Both log files and database are read, data processed and reflected in visual graphs.
Lot of custom tools developed for automating logs collection across servers, custom replication setup (multi threaded approach)
I have presented a rough architecture. More on these tools you can find by searching the net
All of them designed from open source framework and linux platform
  • DB – Mysql
  • ElasticSearch – Open source text based indexing (similar to DB)
  • StatchD – Network Daemon Tool on Node.Js
  • StatsD Layer – For RegEx patterns, Aggregates, Deviations
This is very interesting approach, Looking at both Database & Application Events to ensure data is loaded / monitored on both the ends.
This can still be simplified by storing in a NOSQL Database and querying on top of it
I am not sure if this would simplify their approach. It again depends on the production scenario / usage. All of this approach / Architecture can be implemented using .NET / Java / Python.
Happy Learning!!!

No comments: