Experimenting with Kafka and Spark using Pyspark
Example 1 - Kafka Publish - Consume
Example 2 - Kafka Publish - Spark Consume
Happy Learning!!!
Showing posts with label Pyspark. Show all posts
Showing posts with label Pyspark. Show all posts
April 02, 2019
October 05, 2018
Deep Dive PySpark Examples - Big Data Setup - Part II
After experimenting a bit of pyspark I feel Its much better to handle with R / Python. Most of things we can achieve are repetitive between R /Python / Spark / SQL.
- Data Pipeline tasks at DB Level
- One Hot Encoding also can done with basic TSQL Code
- While working in NLP it makes sense to use TF-IDF Vectorizers
Happy Learning!!!
Labels:
Big Data,
Big Data Setup,
Data Science,
Data Science Tips,
Pyspark
September 28, 2014
Spark Overview
I remember Spark keyword appeared during Big Data Architecture discussions in my Team, I never looked more into Spark. Session by Jyotiska NK on Python + Spark: Lightning Fast Cluster Computing was a useful starter about Spark. (slides of talk)
Spark
- In memory cluster computing framework for large scale data processing
- Developed using scala with Java + Python APIs
- This is not meant to replace hadoop. It can sit on top of Hadoop
- References on Spark Summit for Slides / Videos to learn from past events - link
Python Offerings
- PySpark, Data pipeline using spark
- Spark for real time / batch processing
Spark Vs Map Reduce Differences
This section was session highlighter. They way how data is handled between Map Reduce Execution and Spark Approach is Key.
Map Reduce Approach - Load Data from Disk into RAM, Mapper, Shuffler, Reducer are the different approaches. Processing is distributed. Fault Tolerance is achieved by replicating data
Spark - Load data in RAM, Keep it until you are done, Data is cached in RAM from disk for iterative processing. If data is too large, rest is spilled into disk. Interactive processing of datasets without having to load data in memory. RDD (Resilent distributed datasets)
RDD - Read Only collection of objects across machines. On losing information this can still be recomputed.
RDD Operations
- Transformations - Map, Filter, Sort, flatmap
- Action - Reduce, Count, Collect, Save to local data in disk. Action usually involves disk operations
More Reads
Testing Spark Best Practices
Gatling - Open Source Perf Test Framework
Spark Paper
Happy Learning!!!
Labels:
Big Data,
Big Data Setup,
Pycon2014,
Pyspark,
Spark
Subscribe to:
Posts (Atom)

