My second semester classes
started. The first session was very interesting and a great introduction to
world of data science. I have read / re-read same type of definitions / introductory
articles on data science. Prof.Manish Singh session gave a whole new analogy
and interesting examples to correlate with.
For big data I have always referred back to 4 Vs. Volume,
Veracity, Velocity and Variety. In the same analogy the definition was
presented as
- Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
- Internet of People - Email, Facebook, Linkedin etc
- Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
- Internet of Location - Spatial Data related analysis
Big Data = Crude Oil
"Big data is about
extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it
through ‘pipelines’ and storing it in massive ‘silos’"
Data Science – Data science is inter disciplinary field to extract
knowledge from data.
Data Science workflow involves Data
Visualization, Data Analysis, Data processing and Data Storage tasks. Some of
tools used in each layer are listed below.
Tools available | |
Data Visualization |
Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool) |
Data Analysis |
Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR |
Data Processing |
Scheduler – Azkaban, Oozie, Ivory Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper |
Data Storage |
HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System)
Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue Columnar Storage – Zebra Key Value - Hbase |
The key ingredients of Data Science are
·
Data Management System
·
Data Mining
·
Computational process to identify patterns in
large data sets
·
Use techniques at intersection of multiple
disciplines (AI, Stats, Machine Learning, Computer Networks)
·
Data Classification, Clustering, regression and
association rule finding and anomaly detection
·
Process Mining
·
Aim to discover, monitor, improve real time
processes (eg logs, events, alerts, rules)
·
Information Visualization
·
Visualization techniques for large data sets,
Interactive Information Visualization, How to really visualize big data
Databases | Data Science | |
Data Value | Previous | Cheap |
Data Volume | Modest | Massive |
Structured | Strongly (Schema) | Weakly or none (text) |
Priorities | Consistency, Error Recovery, Auditability | Speed, Availability, Query richness |
Base | Relational Algebra | Linear algebra |
PS: My professor had provided references to the examples; I
am sharing this post based on notes / slides from my session.
Happy Learning!!!
No comments:
Post a Comment