Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Pyspark

April 02, 2019

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.

Data Pipeline tasks at DB Level
One Hot Encoding also can done with basic TSQL Code
While working in NLP it makes sense to use TF-IDF Vectorizers

Happy Learning!!!

September 28, 2014

Spark Overview

I remember Spark keyword appeared during Big Data Architecture discussions in my Team, I never looked more into Spark. Session by Jyotiska NK on Python + Spark: Lightning Fast Cluster Computing was a useful starter about Spark. (slides of talk)

Spark

In memory cluster computing framework for large scale data processing
Developed using scala with Java + Python APIs
This is not meant to replace hadoop. It can sit on top of Hadoop
References on Spark Summit for Slides / Videos to learn from past events - link

Python Offerings

PySpark, Data pipeline using spark
Spark for real time / batch processing

Spark Vs Map Reduce Differences

This section was session highlighter. They way how data is handled between Map Reduce Execution and Spark Approach is Key.

Map Reduce Approach - Load Data from Disk into RAM, Mapper, Shuffler, Reducer are the different approaches. Processing is distributed. Fault Tolerance is achieved by replicating data

Spark - Load data in RAM, Keep it until you are done, Data is cached in RAM from disk for iterative processing. If data is too large, rest is spilled into disk. Interactive processing of datasets without having to load data in memory. RDD (Resilent distributed datasets)

RDD - Read Only collection of objects across machines. On losing information this can still be recomputed.

RDD Operations

Transformations - Map, Filter, Sort, flatmap
Action - Reduce, Count, Collect, Save to local data in disk. Action usually involves disk operations

Read Quote of Ramzi Alqrainy's answer to What are use cases for spark vs hadoop? on Quora

More Reads
Testing Spark Best Practices
Gatling - Open Source Perf Test Framework
Spark Paper

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

April 02, 2019

Day #232 - Kafka + Spark Integration - Big Data Setup - Part I

October 05, 2018

Deep Dive PySpark Examples - Big Data Setup - Part II

September 28, 2014

Spark Overview

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts