Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Big Data Setup

Showing posts with label Big Data Setup. Show all posts

July 04, 2019

Day #261 - Setting up Spark on my Ubuntu - Big Data Setup - Part V

Happy Data Thinking!!!

June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

Finally Setting up all Big Data Tools in my Linux Setup. Rough Steps and my reference notes

Happy Data Thinking!!!!

April 02, 2019

Day #232 - Kafka + Spark Integration - Big Data Setup - Part I

Experimenting with Kafka and Spark using Pyspark

Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume

Happy Learning!!!

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives

The story of Data in Motion

Streaming data (Incoming data)
Passing data (Data between certain time interval)
Transactional data (Current data in operation)
Historical data (Transaction completed)

The most famous example is the e-commerce segment. The data story evolves as

Streaming data (Placing orders) - Kafka
Passing data (Checking orders received is past 30 minutes window) – Spark
Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
Completed data - Completed orders move it to hdfs, build hive table for further analysis.

Data Science Role in the Story

Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)

The key tools summary
Spark

Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing.
RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
Spark SQL – Schema / SQL
Immutable data is always safe to share across multiple processes as well as multiple threads
Machine Learning – ML Lib
Graph Processing – Graphx

Kafka

At the heart of Apache Kafka sits a distributed log
The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
Data organized in topics. Producers write data to brokers, Consumers read data from brokers

HBASE

Low Latency, Consistent, best suited for random read/write big data access
NOSQL database hosted on top of HDFS. Columnar based Database
HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects

Hive

Targeted for Analytics
Natural choice for SQL Developers is Hive
ETL + DW (data summarization, query and analysis)

Pig

Scripting language for analytics queries

Considerations for RDBMS Vs NOSQL

Performance - Latency tolerance, how slow my queries can run for huge data sets
Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
Consistency - Weird results tolerance (Dirty data tolerance)
Availability - Downtime tolerance

The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need

The tools remain the same but mapping is different across different cloud providers

Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.

Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)

My Whiteboard

More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data

Update - Oct 18th 2020

The new wave of current products, cloud providers is impressive. Reusing from link, Post

Architecture

BI Architecture

Data Processing

AI Architecture

Time to write your own Data Story!!!

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.

Data Pipeline tasks at DB Level
One Hot Encoding also can done with basic TSQL Code
While working in NLP it makes sense to use TF-IDF Vectorizers

Happy Learning!!!

September 28, 2014

Spark Overview

I remember Spark keyword appeared during Big Data Architecture discussions in my Team, I never looked more into Spark. Session by Jyotiska NK on Python + Spark: Lightning Fast Cluster Computing was a useful starter about Spark. (slides of talk)

Spark

In memory cluster computing framework for large scale data processing
Developed using scala with Java + Python APIs
This is not meant to replace hadoop. It can sit on top of Hadoop
References on Spark Summit for Slides / Videos to learn from past events - link

Python Offerings

PySpark, Data pipeline using spark
Spark for real time / batch processing

Spark Vs Map Reduce Differences

This section was session highlighter. They way how data is handled between Map Reduce Execution and Spark Approach is Key.

Map Reduce Approach - Load Data from Disk into RAM, Mapper, Shuffler, Reducer are the different approaches. Processing is distributed. Fault Tolerance is achieved by replicating data

Spark - Load data in RAM, Keep it until you are done, Data is cached in RAM from disk for iterative processing. If data is too large, rest is spilled into disk. Interactive processing of datasets without having to load data in memory. RDD (Resilent distributed datasets)

RDD - Read Only collection of objects across machines. On losing information this can still be recomputed.

RDD Operations

Transformations - Map, Filter, Sort, flatmap
Action - Reduce, Count, Collect, Save to local data in disk. Action usually involves disk operations

Read Quote of Ramzi Alqrainy's answer to What are use cases for spark vs hadoop? on Quora

More Reads
Testing Spark Best Practices
Gatling - Open Source Perf Test Framework
Spark Paper

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

July 04, 2019

Day #261 - Setting up Spark on my Ubuntu - Big Data Setup - Part V

June 08, 2019

Day #259 - Setting up Kafka on my Ubuntu - Big Data Setup - Part IV

April 02, 2019

Day #232 - Kafka + Spark Integration - Big Data Setup - Part I

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

September 28, 2014

Spark Overview

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts