Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives
The story of Data in Motion
Spark
The new wave of current products, cloud providers is impressive. Reusing from link, Post
Architecture
The story of Data in Motion
- Streaming data (Incoming data)
- Passing data (Data between certain time interval)
- Transactional data (Current data in operation)
- Historical data (Transaction completed)
- Streaming data (Placing orders) - Kafka
- Passing data (Checking orders received is past 30 minutes window) – Spark
- Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
- Completed data - Completed orders move it to hdfs, build hive table for further analysis.
- Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
- Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)
Spark
- Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing.
- RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
- Spark SQL – Schema / SQL
- Immutable data is always safe to share across multiple processes as well as multiple threads
- Machine Learning – ML Lib
- Graph Processing – Graphx
Kafka
- At the heart of Apache Kafka sits a distributed log
- The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
- When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
- Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
- Data organized in topics. Producers write data to brokers, Consumers read data from brokers
HBASE
- Low Latency, Consistent, best suited for random read/write big data access
- NOSQL database hosted on top of HDFS. Columnar based Database
- HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects
Hive
- Targeted for Analytics
- Natural choice for SQL Developers is Hive
- ETL + DW (data summarization, query and analysis)
Pig
- Scripting language for analytics queries
Considerations for RDBMS Vs NOSQL
- Performance - Latency tolerance, how slow my queries can run for huge data sets
- Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
- Consistency - Weird results tolerance (Dirty data tolerance)
- Availability - Downtime tolerance
The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need
The tools remain the same but mapping is different across different cloud providers
Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.
Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)
My Whiteboard
More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data
Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)
My Whiteboard
Pattern: Database per service
The Hardest Part About Microservices: Your Data
Update - Oct 18th 2020
The new wave of current products, cloud providers is impressive. Reusing from link, Post
BI Architecture
Data Processing
AI Architecture
Time to write your own Data Story!!!