"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 01, 2016

World of Data Science

My second semester classes started. The first session was very interesting and a great introduction to world of data science. I have read / re-read same type of definitions / introductory articles on data science. Prof.Manish Singh session gave a whole new analogy and interesting examples to correlate with.

For big data I have always referred back to 4 Vs. Volume, Veracity, Velocity and Variety. In the same analogy the definition was presented as
  • Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
  • Internet of People - Email, Facebook, Linkedin etc
  • Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
  • Internet of Location - Spatial Data related analysis 
This Internet of * is a good representation of different forms / flows of information representing four Vs

Big Data = Crude Oil

"Big data is about extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’ and storing it in massive ‘silos’"

Data Science – Data science is inter disciplinary field to extract knowledge from data.

Data Science workflow involves Data Visualization, Data Analysis, Data processing and Data Storage tasks. Some of tools used in each layer are listed below. 


Tools available

Data Visualization
Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool)

Data Analysis
Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR

Data Processing

Scheduler – Azkaban, Oozie, Ivory
Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper

Data Storage
HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System)
Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue
Columnar Storage – Zebra
Key Value - Hbase

The key ingredients of Data Science are
·         Data Management System
·         Data Mining
·         Computational process to identify patterns in large data sets
·         Use techniques at intersection of multiple disciplines (AI, Stats, Machine Learning, Computer Networks)
·         Data Classification, Clustering, regression and association rule finding and anomaly detection
·         Process Mining
·         Aim to discover, monitor, improve real time processes (eg logs, events, alerts, rules)
·         Information Visualization
·         Visualization techniques for large data sets, Interactive Information Visualization, How to really visualize big data


Databases Vs Data Science
Databases Data Science
Data Value Previous Cheap
Data Volume Modest Massive
Structured Strongly (Schema) Weakly or none (text)
Priorities Consistency, Error Recovery, Auditability Speed, Availability, Query richness
Base Relational Algebra Linear algebra

PS: My professor had provided references to the examples; I am sharing this post based on notes / slides from my session.  

Happy Learning!!!
Post a Comment