Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #230 - What is your Data Story? - Big Data Setup

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives

The story of Data in Motion

Streaming data (Incoming data)
Passing data (Data between certain time interval)
Transactional data (Current data in operation)
Historical data (Transaction completed)

The most famous example is the e-commerce segment. The data story evolves as

Streaming data (Placing orders) - Kafka
Passing data (Checking orders received is past 30 minutes window) – Spark
Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
Completed data - Completed orders move it to hdfs, build hive table for further analysis.

Data Science Role in the Story

Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)

The key tools summary
Spark

Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing.
RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
Spark SQL – Schema / SQL
Immutable data is always safe to share across multiple processes as well as multiple threads
Machine Learning – ML Lib
Graph Processing – Graphx

Kafka

At the heart of Apache Kafka sits a distributed log
The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
Data organized in topics. Producers write data to brokers, Consumers read data from brokers

HBASE

Low Latency, Consistent, best suited for random read/write big data access
NOSQL database hosted on top of HDFS. Columnar based Database
HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects

Hive

Targeted for Analytics
Natural choice for SQL Developers is Hive
ETL + DW (data summarization, query and analysis)

Pig

Scripting language for analytics queries

Considerations for RDBMS Vs NOSQL

Performance - Latency tolerance, how slow my queries can run for huge data sets
Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
Consistency - Weird results tolerance (Dirty data tolerance)
Availability - Downtime tolerance

The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need

The tools remain the same but mapping is different across different cloud providers

Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.

Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)

My Whiteboard