Happy Data Thinking!!!
Showing posts with label Big Data Setup. Show all posts
Showing posts with label Big Data Setup. Show all posts
July 04, 2019
June 08, 2019
Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV
Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes
Happy Data Thinking!!!!
Happy Data Thinking!!!!
Labels:
Big Data,
Big Data Setup,
Data Science,
Data Science Tips,
Data Story,
Kafka
April 02, 2019
Day #232 - Kafka + Spark Integration - Big Data Setup - Part I
Experimenting with Kafka and Spark using Pyspark
Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume
Happy Learning!!!
Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume
Happy Learning!!!
Labels:
Big Data,
Big Data Setup,
Data Science,
Data Science Tips,
Pyspark
March 31, 2019
Day #230 - What is your Data Story? - Big Data Setup - Part III
Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives
The story of Data in Motion
Spark
The new wave of current products, cloud providers is impressive. Reusing from link, Post
Architecture
The story of Data in Motion
- Streaming data (Incoming data)
- Passing data (Data between certain time interval)
- Transactional data (Current data in operation)
- Historical data (Transaction completed)
- Streaming data (Placing orders) - Kafka
- Passing data (Checking orders received is past 30 minutes window) – Spark
- Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
- Completed data - Completed orders move it to hdfs, build hive table for further analysis.
- Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
- Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)
Spark
- Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing.
- RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
- Spark SQL – Schema / SQL
- Immutable data is always safe to share across multiple processes as well as multiple threads
- Machine Learning – ML Lib
- Graph Processing – Graphx
Kafka
- At the heart of Apache Kafka sits a distributed log
- The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
- When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
- Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
- Data organized in topics. Producers write data to brokers, Consumers read data from brokers
HBASE
- Low Latency, Consistent, best suited for random read/write big data access
- NOSQL database hosted on top of HDFS. Columnar based Database
- HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects
Hive
- Targeted for Analytics
- Natural choice for SQL Developers is Hive
- ETL + DW (data summarization, query and analysis)
Pig
- Scripting language for analytics queries
Considerations for RDBMS Vs NOSQL
- Performance - Latency tolerance, how slow my queries can run for huge data sets
- Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
- Consistency - Weird results tolerance (Dirty data tolerance)
- Availability - Downtime tolerance
The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need
The tools remain the same but mapping is different across different cloud providers
Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.
Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)
My Whiteboard
More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data
Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)
My Whiteboard
Pattern: Database per service
The Hardest Part About Microservices: Your Data
Update - Oct 18th 2020
The new wave of current products, cloud providers is impressive. Reusing from link, Post
BI Architecture
Data Processing
AI Architecture
Time to write your own Data Story!!!
Labels:
Big Data,
Big Data Setup,
Data Science,
Data Science Tips,
Kafka
October 05, 2018
Deep Dive PySpark Examples - Big Data Setup - Part II
After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.
- Data Pipeline tasks at DB Level
- One Hot Encoding also can done with basic TSQL Code
- While working in NLP it makes sense to use TF-IDF Vectorizers
Happy Learning!!!
Labels:
Big Data,
Big Data Setup,
Data Science,
Data Science Tips,
Pyspark
September 28, 2014
Spark Overview
I remember Spark keyword appeared during Big Data Architecture discussions in my Team, I never looked more into Spark. Session by Jyotiska NK on Python + Spark: Lightning Fast Cluster Computing was a useful starter about Spark. (slides of talk)
Spark
- In memory cluster computing framework for large scale data processing
- Developed using scala with Java + Python APIs
- This is not meant to replace hadoop. It can sit on top of Hadoop
- References on Spark Summit for Slides / Videos to learn from past events - link
Python Offerings
- PySpark, Data pipeline using spark
- Spark for real time / batch processing
Spark Vs Map Reduce Differences
This section was session highlighter. They way how data is handled between Map Reduce Execution and Spark Approach is Key.
Map Reduce Approach - Load Data from Disk into RAM, Mapper, Shuffler, Reducer are the different approaches. Processing is distributed. Fault Tolerance is achieved by replicating data
Spark - Load data in RAM, Keep it until you are done, Data is cached in RAM from disk for iterative processing. If data is too large, rest is spilled into disk. Interactive processing of datasets without having to load data in memory. RDD (Resilent distributed datasets)
RDD - Read Only collection of objects across machines. On losing information this can still be recomputed.
RDD Operations
- Transformations - Map, Filter, Sort, flatmap
- Action - Reduce, Count, Collect, Save to local data in disk. Action usually involves disk operations
More Reads
Testing Spark Best Practices
Gatling - Open Source Perf Test Framework
Spark Paper
Happy Learning!!!
Labels:
Big Data,
Big Data Setup,
Pycon2014,
Pyspark,
Spark
Subscribe to:
Posts (Atom)





