Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): World of Data Science

February 01, 2016

World of Data Science

My second semester classes started. The first session was very interesting and a great introduction to world of data science. I have read / re-read same type of definitions / introductory articles on data science. Prof.Manish Singh session gave a whole new analogy and interesting examples to correlate with.

For big data I have always referred back to 4 Vs. Volume, Veracity, Velocity and Variety. In the same analogy the definition was presented as

Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
Internet of People - Email, Facebook, Linkedin etc
Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
Internet of Location - Spatial Data related analysis

This Internet of * is a good representation of different forms / flows of information representing four Vs

Big Data = Crude Oil

"Big data is about extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’ and storing it in massive ‘silos’"

Data Science – Data science is inter disciplinary field to extract knowledge from data.

Data Science workflow involves Data Visualization, Data Analysis, Data processing and Data Storage tasks. Some of tools used in each layer are listed below.

	Tools available
Data Visualization	Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool)
Data Analysis	Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR
Data Processing	Scheduler – Azkaban, Oozie, Ivory Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper
Data Storage	HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System) Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue Columnar Storage – Zebra Key Value - Hbase

The key ingredients of Data Science are

· Data Management System

· Data Mining

· Computational process to identify patterns in large data sets

· Use techniques at intersection of multiple disciplines (AI, Stats, Machine Learning, Computer Networks)

· Data Classification, Clustering, regression and association rule finding and anomaly detection

· Process Mining

· Aim to discover, monitor, improve real time processes (eg logs, events, alerts, rules)

· Information Visualization

· Visualization techniques for large data sets, Interactive Information Visualization, How to really visualize big data

Databases Vs Data Science

	Databases	Data Science
Data Value	Previous	Cheap
Data Volume	Modest	Massive
Structured	Strongly (Schema)	Weakly or none (text)
Priorities	Consistency, Error Recovery, Auditability	Speed, Availability, Query richness
Base	Relational Algebra	Linear algebra

PS: My professor had provided references to the examples; I am sharing this post based on notes / slides from my session.

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

February 01, 2016

World of Data Science

No comments:

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts