This post is primarily notes taken during Big Data Conference - The Fifth Elephant.
#1. Fifth Elephant Conference - Crunching Big Data, Google Scale by Rahul Kulkarni
First Session was ‘Scaling Data Google Scale’ by Google Employee Rahul Kulkarni. Captured below are notes from the session
Session covered on Google App Engine, Google Compute Engine, How google manages processing huge volumes of data. The two primary factors around data processing are Compute at scale, Adhoc querying on large volume of data
Google App Engine
- PaaS (Provided as Platform as a Service)
- Stats on Data processing volumes – 7.5B hits per day and 2 Trillion transactions per month
- IaaS (Infrastructure as a service)
- Analytics workload targeted
- Supports Deploying your own cluster
- Example of how Genome processing (large data sets) was shared. GCE reduced computation time for genome processing significantly
- Google whitepapers to checkout
- Dremel (2010)
- Drapper (2010) – For Tracing purpose
- Flume (2010) – Data Pipeline
- Protocol buffers (2008)
- Chubby (2006)
Other interesting white papers I have shared in my earlier posts
Google’s Approach for Data Processing (Adhoc Queries)- Big Query Approach - Uses Column oriented storage
- Supports Map reduce jobs as well (3 Phases Mapper, Shuffler, Reducer)
- Big Query Supports small joins, In case of joins the required data is moved to where column data is located
Google Cloud based Solution for Data
- App Engine (Front End)
- Big Query (Process Data)
- Cloud Storage (Data Storage)
Key Learning’s
- Google cloud platform can be used for prototypes involving big data
- Columnar databases gaining market share for analytics (Hadapt, Vertica etc..)
- Bunch of new whitepapers I learnt from the session talk
#2. Fifth Elephant Conference – In Data We Believe Session Notes
Session by Harish Pillay from Redhat, Briefly covered on big data characteristics, opportunities, offerings from Red hat for Big Data
What is Data? 1’s and 0’s organized in a manner that provides meaning when interpreted
Structured Data Characteristics – Schema available, normalized, predictable, known
Unstructured Data Characteristics – Semistructured like log files, unorganized, no fixed schema
Redhat offerings for cloud, big data were discussed. Redhat Linux, JBOSS, Redhat storage and openshift products were highlighted.
#3. Fifth Elephant Conference – Hadoop ecosystem overview Session Notes
Session by Vinayak Hegde from InMobi. How they manage big data processing. What tools and framework they rely on for data processing
Introductory slides covering on data generated in large volumes from mobile, social networks, financial system, tweets, blogs etc..
He listed dozen open source projects for different layers involved in data processing. Listed below are projects I noted during the session. Data Stack was a very good slide
Session was full of tools used at each layer. Unfortunately presentation was cut short as it exceeded allowed duration. This tools list is a good starter kit to start exploring.
Key Learning’s
- Open source tools that can be leveraged for custom Hadoop based cluster setup and management. These tools are a good place to get started for large scale Hadoop installations
Happy Learning!!!
No comments:
Post a Comment