"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Big Data Setup. Show all posts
Showing posts with label Big Data Setup. Show all posts

June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes




Happy Data Thinking!!!!

April 02, 2019

Day #232 - Kafka + Spark Integration - Big Data Setup - Part I

Experimenting with Kafka and Spark using Pyspark

Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume

Happy Learning!!!

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives

The story of Data in Motion
  • Streaming data (Incoming data)
  • Passing data (Data between certain time interval)
  • Transactional data (Current data in operation)
  • Historical data (Transaction completed)
The most famous example is the e-commerce segment. The data story evolves as
  • Streaming data (Placing orders) - Kafka
  • Passing data (Checking orders received is past 30 minutes window) – Spark
  • Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
  • Completed data - Completed orders move it to hdfs, build hive table for further analysis.
Data Science Role in the Story
  • Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
  • Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)
The key tools summary
Spark
  • Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing. 
  • RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
  • Spark SQL – Schema / SQL
  • Immutable data is always safe to share across multiple processes as well as multiple threads
  • Machine Learning – ML Lib
  • Graph Processing – Graphx
Kafka
  • At the heart of Apache Kafka sits a distributed log
  • The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
  • When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
  • Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
  • Data organized in topics. Producers write data to brokers, Consumers read data from brokers
HBASE
  • Low Latency, Consistent, best suited for random read/write big data access
  • NOSQL database hosted on top of HDFS. Columnar based Database
  • HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects
Hive
  • Targeted for Analytics
  • Natural choice for SQL Developers is Hive 
  • ETL + DW (data summarization, query and analysis)
Pig
  • Scripting language for analytics queries
Considerations for RDBMS Vs NOSQL
  • Performance - Latency tolerance, how slow my queries can run for huge data sets
  • Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need


The tools remain the same but mapping is different across different cloud providers 



Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.

Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)

My Whiteboard


More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data

Update - Oct 18th 2020

The new wave of current products, cloud providers is impressive. Reusing from link, Post

Architecture



BI Architecture



Data Processing



AI Architecture


Time to write your own Data Story!!!

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.

  • Data Pipeline tasks at DB Level
  • One Hot Encoding also can done with basic TSQL Code
  • While working in NLP it makes sense to use TF-IDF Vectorizers

Happy Learning!!!


September 28, 2014

Spark Overview


I remember Spark keyword appeared during Big Data Architecture discussions in my Team, I never looked more into Spark. Session by Jyotiska NK on Python + Spark: Lightning Fast Cluster Computing was a useful starter about Spark. (slides of talk)

Spark 
  • In memory cluster computing framework for large scale data processing
  • Developed using scala with Java + Python APIs
  • This is not meant to replace hadoop. It can sit on top of Hadoop
  • References on Spark Summit for Slides / Videos to learn from past events - link 
Python Offerings
  • PySpark, Data pipeline using spark
  • Spark for real time / batch processing
Spark Vs Map Reduce Differences
This section was session highlighter. They way how data is handled between Map Reduce Execution and Spark Approach is Key.

Map Reduce Approach - Load Data from Disk into RAM, Mapper, Shuffler, Reducer are the different approaches. Processing is distributed. Fault Tolerance is achieved by replicating data 

Spark - Load data in RAM, Keep it until you are done, Data is cached in RAM from disk for iterative processing. If data is too large, rest is spilled into disk. Interactive processing of datasets without having to load data in memory. RDD (Resilent distributed datasets)

RDD - Read Only collection of objects across machines. On losing information this can still be recomputed. 

RDD Operations
  • Transformations - Map, Filter, Sort, flatmap
  • Action - Reduce, Count, Collect, Save to local data in disk. Action usually involves disk operations

More Reads
Testing Spark Best Practices
Gatling - Open Source Perf Test Framework
Spark Paper

Happy Learning!!!