"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 31, 2019

Day #230 - What is your Data Story? - Big Data Setup - Part III

Tools evolve, patterns and architecture vary as we progress. To build your data story you need to think in certain perspectives

The story of Data in Motion
  • Streaming data (Incoming data)
  • Passing data (Data between certain time interval)
  • Transactional data (Current data in operation)
  • Historical data (Transaction completed)
The most famous example is the e-commerce segment. The data story evolves as
  • Streaming data (Placing orders) - Kafka
  • Passing data (Checking orders received is past 30 minutes window) – Spark
  • Transitional data (Orders placed) - HBase / NoSQL / any RDBMS based on business need
  • Completed data - Completed orders move it to hdfs, build hive table for further analysis.
Data Science Role in the Story
  • Real Time Machine Learning on Spark (30 Minute Internal Data, Clustering Sales order to group them into similar clusters, Clustering orders based on sellers and products etc..) to understand segmentation of data at that window interval
  • Perform Machine Learning on the Historical Completed Data (Recommendations, Forecast, Predictions etc.)
The key tools summary
Spark
  • Spark Streaming – Real-time querying, Load RDD data in RAM, keep it until you are done, Data is cached in RAM from disk for iterative processing. 
  • RDD (Resilient distributed datasets). RDD - Read Only collection of objects across machines
  • Spark SQL – Schema / SQL
  • Immutable data is always safe to share across multiple processes as well as multiple threads
  • Machine Learning – ML Lib
  • Graph Processing – Graphx
Kafka
  • At the heart of Apache Kafka sits a distributed log
  • The log-structured approach is itself a simple idea: a collection of messages, appended sequentially to a file.
  • When a service wants to read messages from Kafka it ‘seeks’ to the position of the last message it read, then scans sequentially, reading messages in order, while periodically recording its new position in the log.
  • Data is immutable. When you read messages from Kafka, Data is copied directly from the disk buffer to the network buffer
  • Data organized in topics. Producers write data to brokers, Consumers read data from brokers
HBASE
  • Low Latency, Consistent, best suited for random read/write big data access
  • NOSQL database hosted on top of HDFS. Columnar based Database
  • HBase uses HDFS as its data storage layer, this takes care of fault tolerance, scalability aspects
Hive
  • Targeted for Analytics
  • Natural choice for SQL Developers is Hive 
  • ETL + DW (data summarization, query and analysis)
Pig
  • Scripting language for analytics queries
Considerations for RDBMS Vs NOSQL
  • Performance - Latency tolerance, how slow my queries can run for huge data sets
  • Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
  • Consistency - Weird results tolerance (Dirty data tolerance)
  • Availability - Downtime tolerance
The lambda architecture is the reference for the fast and batch layer, With Machine learning and more tools evolving it would be helpful to think in terms of Data Story in an end to end perspective and fit in tools for your need


The tools remain the same but mapping is different across different cloud providers 



Data is the same but we have progressed further to query data in motion. Tools evolve but your data story remains the same. Come out of tools let's build a data story and connect the dots.

Old Process – Model – Collect - Analyze Data
New Process – Collect – Analyze Data in Motion – Build Model (Paradigm Shift)

My Whiteboard


More Reads
Pattern: Database per service
The Hardest Part About Microservices: Your Data

Update - Oct 18th 2020

The new wave of current products, cloud providers is impressive. Reusing from link, Post

Architecture



BI Architecture



Data Processing



AI Architecture


Time to write your own Data Story!!!

March 29, 2019

Day #229 - Running Pytorch Model in OpenVino

Step 1 - ONNX Pre-Requisites Install


Step 2 - Save Pytorch Model in ONNX Format

#Customize and Run this model

Step 3 - Goto Model Optimizer Directory
sudo python3 mo.py --input_model /home/ubuntu/code/resnet51mid.onnx 
<code>python3 mo.py --input_model <INPUT_MODEL>.onnx</code>



This will generate the required xml to be run with OpenVino Model

Step 4 - Custom Model Training


Step 5 - Custom Model Export


./pedestrian_tracker_demo -i /home/ubuntu/code/smarthub_915am_cut.mp4.mp4 -m_det /opt/intel/computer_vision_sdk/deployment_tools/intel_models/person-detection-retail-0013/FP32/person-detection-retail-0013.xml -m_reid /opt/intel/computer_vision_sdk/deployment_tools/model_optimizer/resnet51mid.xml -d_det CPU

Happy Mastering DL!!!

March 28, 2019

Some thoughts on 'Data Use'

Honorable use of Data
  • Using Sales data for Overall recommendations 
  • Provide Options to Delete Historical Sales Data
  • Clarity on Data Ownership on Personal Information
Privacy Concerns
  • Tracking Personal Contacts
  • Tracking Location and Other Activity
  • Ethics, Empathy before Click Impression Recommendations
Happy Thinking!!!

Day #228 - OpenVino and pedestrian_tracker_demo

1. Download Demo Projects from link
2. Follow instructions in link to build demos
3. Execute commands

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release <path_to_inference_engine_demos_directory>


make (Please check below)



4. pedestrian_tracker_demo - Documentation 

How it works
Step 1 -  Primary detection network for finding pedestrians
Step 2 -  Inference of the first network and makes reidentification of the pedestrians

Command to Run pedestrian_tracker
./pedestrian_tracker_demo -i <path_video_file> \
                          -m_det <path_person-detection-retail-0013>/person-detection-retail-0013.xml \
                          -m_reid <path_person-reidentification-retail-0031>/person-reidentification-retail-0031.xml \
                          -d_det GPU

5. Actual Command
./pedestrian_tracker_demo -i /home/ubuntu/code/smarthub_915am_cut.mp4.mp4 -m_det /opt/intel/computer_vision_sdk/deployment_tools/intel_models/person-detection-retail-0013/FP32/person-detection-retail-0013.xml -m_reid /opt/intel/computer_vision_sdk/deployment_tools/intel_models/person-reidentification-retail-0031/FP32/person-reidentification-retail-0031.xml -d_det CPU

Finally I was able to run through this. It is interesting they have two models in real time - Detection and Re-Identification. It will be interesting to know how this model has been trained, the counting of people re-entering, the inferences that is done through this. There must be a base data for training which decides detection, Also the re-id logic of comparison between the detected and similar one. 

Not Needed
1. Goto Download Directory ./downloader.py --all

Happy Mastering DL!!!

March 27, 2019

Day #227 - Learning Re-Id

1. Installed the Pytorch framework for deep-learning person re-identification - Link
2. Followed the steps commenting out the GPU code and ran it on CPU
3. Code to Check Available Models


4. Running it on Other Datasets

5. Next Step - On Custom Dataset Setting up, Testing Models




















Happy Mastering DL!!!

Day #226 - Pytorch Sessions (17-19)

Key Lessons
Class Basic Example

NN code

Multiplication

Happy Mastering DL!!!!



March 26, 2019

Day #225 - Pytorch Sessions (14-16)



Happy Mastering DL!!!

The Data Collectors

Data Collectors
  • Shopping - Amazon, Walmart
  • Search - Google
  • Email - Google
  • Social Life - Facebook
Segmentation by Domain
  • Data Around Consumption - Amazon
  • Intent - Google
  • Content Consumption - Facebook
  • Healthcare, Education & Adaptive Learning, Financial Regulation, Credit - No Winner
Rough Map of a 'Typical Social Network'

Data Collectors

Data Intents
  • Collect Moments
  • Connect Relationships
End Customers - Data Segmentation
  • Organizations
  • Kids, Teens
  • Working Professionals
  • Family
  • Friends
Honorable use of Data
  • Recommendations
  • Options to Delete
  • Data Ownership
Privacy Concerns
  • Tracking Personal Contacts
  • Tracking Location and Other Activity
  • Ethics
  • Empathy
Collect data and use it in a honorable way!!!

Day#224 - OpenVino on Ubuntu

Part I - Setting up on Ubuntu for ML work - Link
Base Link - Steps

Few Changes



source /opt/intel/computer_vision_sdk/bin/setupvars.sh
nano ~/.bashrc
source /opt/intel/computer_vision_sdk/bin/setupvars.sh



pip install protobuf

Demo Output


Happy Learning!!!

March 22, 2019

Day #221 - Session 5-8 - Pytorch

Key Lessons
Session #4
  • Tensor Attributes
  • Inputs / Outputs represented by tensors
  • Scalar, Vector, Matrix (Used in maths)
  • Number (Zero Index), Array (One Index), 2D-Array (Two Index)
  • Tensors are multi-dimensional arrays
  • Tensor Attributes, Rank, Axes, Shape
Session #5 
  • Rank - Refers to the number of dimensions present in tensor, How many indices required to access an element
  • Shape - Length of each axes, Shape allows visualizing tensor
Session #6
  • Image Input - to CNN as tensor
  • Axes from Right to left
  • [? A0,? A1,? A2,? A3] 
  • Height / Width on last axes
  • [?,?,H,W]
  • Color Channels
  • [?,C,H,W]
  • Three Indexes for Color Channel
  • First Axes - Batch Size (Batches of samples)
  • [3,1,28,80
  • Batch of 3 images
  • Single Color Channel
  • Height, Width 28,28
  • Three channel output for three layers
Session #7
  • Data to Tensors
  • torch.tensor class
  • Top level torch package
  • Tensors contain data of uniform type
  • Computation depend on type and device
  • Computation has to be with same datatypes and same devices
  • matching CPU and GPU versions



Happy Mastering DL!!!

March 21, 2019

Day #220 - Pytorch (1-4 - 10 Mins Sessions)

Key Summary
Session #1
  • Tensors - Data Structures of Deep Learning
Session #2
  • DL framework and scientific computing package
  • tensor nd array
  • numpy goto package for nd array
  • Interoperable with numpy
  • Torch based on Lua Programming Language
  • Now maintained by Facebook
Deep Learning Features of Pytorch
  • Pytorch packages
  • torch - Top level package
  • torch.nn - Build neural networks
  • torch.autograd - Differentiable tensor operations - Derivative Calculations
  • torch.nn.functional - loss, activation and convolution functions
  • torch.optim - Optimization operations like SGD and Adam
  • torchvision - image transformations of computer vision
  • torch.utils - Dealing with Datasets
  • Preferred framework for research
  • For computing derivatives, computation graph
  • pytorch uses dynamic computation graph
Session #3
  • Anaconda python package manager
  • conda install Pytorch -c Pytorch
  • pip install torchvision
  • cuda 9.0 is stable
  • install pytorch and torchvision
  • conda list pytorch
  • Data + Knowledge + Architecture = Enormous Software
Session #4
  • GPU - Graphics Processing Unit
  • Good at handling Specialized Computing
  • Parallel Computing
  • Smaller computation carried out simultaneously
  • NN are embarrassingly Parallel
  • Optimize across the entire stack
  • Cuda toolkit
  • GPU + Libraries
  • Supported nvidia gpu
  • Much of Pytorch written in Python
  • Move critical functions to c/c++
  • pytorch supports multiple gpus
  • GPU (CUDA - cuDNN) - Pytorch on top of Cuda
  • Technology built on top of layers
  • Paper (GPGPU Computing)


Happy Mastering DL!!!

March 17, 2019

Predpol - Predictive Policing - Security, Law

Today I checked on Predpol - Predictive analytics to prevent crime. A little assessment from my end on the Data Source - Historical + Real-time and how they can alert proactively.

Historical Data sources
  • Helpline Call Report Records
  • Crime History by Day
  • Crime History by Month / Week / Seasonality / Festivals / Holidays
  • Crime Patterns by Area
  • Data Collected from Crowd Patterns
  • Types of Violence vs Areas (Mob, Bars, Vehicle Accidents)
  • Crime Activity vs Time
  • Type of Crime vs Day Analysis
  • Type of Crime vs Area Analysis
Real-time Assessment
  • Video Cameras to feed on Vehicle, People Movements
  • Crowd Movements in Late Night
  • Alert on Suspicious Movements / Weapons Detection
  • Weather Activity Factor
Digital / Telecom / Social Media Real-time Inputs
  • Monitoring Criminal Network Activity 
  • Network Vehicle Movements
  • Cell Phone Signal Movements
Intelligence Inputs
  • Robbery
  • Violence
  • Floating Population Impact for Meetings / Travel
Game Theory
  • Map the pattern of Time / Area
  • Arrive at optimal Randomized Strategy
You would need different models for different types of crime prediction, the recommendation of patrols based on both historical and real-time movements. 

More Reads
https://www.predpol.com/technology/

Security Related Notes
Digital Evidence Reads
Types of Digital Evidence
  • Communication websites, particularly message boards and chat rooms
  • File-sharing, pictures, and video
  • Downloaded files can often be linked to specific IP addresses
  • User’s web activity, Global Positioning System (GPS) location, and text messaging distinct from email or other documents
  • Contact list, a log of calls made and received, and call duration
  • Physical extraction of a hard drive
  • Social media, cloud storage systems, or private CCTV installations
National and Transnational Security Implications of Big Data in the Life Sciences 
ML Areas
  • Facial recognition
  • Fingerprint identification
  • DNA matching
Happy Mastering DL!!!

March 14, 2019

Experience != Expertise

Software Engineer - Learning to code
Senior Software Engineer - Learning from failures, Coding for a few years
Lead Engineer - Failed in experimenting with different prototypes for a decade
Principal Engineer - Mastered Failure but Master Experimenter, Learns to look code from Failure perspective, Seen more failures to guide the safest way for rest of team

Working for years does not build expertise until you experiment and learn from failure.

Keep Learning!!!

March 13, 2019

Day #219 - Person Re-Identification

All the research papers with code available in site paperswithcode 
Reid Resources 

Overall Lessons
  • CNN based auto encoders to encode an input image, and then using K-nearest neighbor algo, find the closest match to the encoded images in a database
  • Query2Gallery Similarity using Euclidean distance
  • Foreground, Head, Upper Body, Lower Body used for Cues
  • Detection + Classification
  • Local Maximal Occurrence (LOMO) analyzes the horizontal occurrence of local features, and maximizes the occurrence to make a stable representation against viewpoint changes
  • Video tracklets in person re-identification
Key Lessons
Talk #1 - Human Semantic Parsing for Person Re-identification
  • Query Image
  • Retrieve all images of the same identity
  • Query, Top 10 Retrieved Matches
Challenges
  • Illumination Condition
  • Background Clutter
  • Occlusion
  • Observable Body parts not visible
  • Hard to obtain posture
  • Extracting Robot visual representation
  • Low-Resolution Images
Questions
  • Develop complex models?
  • Extract Local Visual Cues?
  • Human Pose Estimation used to estimate 
  • Unable to identify arbitrary contours of body parts
  • Methods of Horizontal stripes
Contributions
  • Human Semantic Parsing (SPReid)
  • Simple holistic models work
SPReid
  • Inception-V3 architecture
  • Modified Inception-V3 architecture
  • Dilated Convolution


Architecture
  • Image - Inception V3
  • Avg Pooling get final representation
  • Foreground, Head, Upper Body, Lower Body used for Cues
  • One Global
  • One Foreground
Training and Evaluation
  • Softmax cross entropy loss
  • Train on low resolution, fine tune on high resolution
  • Look into person (Dataset)
  • Query2Gallery Similarity using Euclidean distance


Talk #2 - Joint Detection and Identification Feature Learning for Person Search | Spotlight 2-2B

Key Lessons
  • Match Photo with Manually Crafted
  • Find from the whole image, Detect People and Extract People and Features
  • Softmax classifier
  • Detection + Classification
  • Online instance Matching
  • Labeled One's Lookup Table
  • Minimize the distance between sathe me person


Talk #3 - Unsupervised Person Re-identification by Deep Learning Tracklet Association



Key Lessons
  • Supervised (Pairwise Neighboring)
  • Triplet Loss
  • Manually Labelled, Impose huge constraint
  • Completely Unsupervised using tracket associations
  • Collect Tracklet Data
  • Tracklet Sampling
  • Tracklet Association
  • Histogram Loss, Surrogate Loss








Siamese Network
Key Lessons
  • Find similar faces
  • Sequence of CNN, Pooling and Feature vector
  • Fed to make classification
  • Number computed vector F(x1) - Encoding of input Image
  • Feed second pic and get another F(x2)
  • Encoding is good representation, Find distance between x1 and x2
  • Two CNN and comparing them is Siamese Network Architecture
  • Train NN that generates encoding


More Reads
One Shot Learning with Siamese Networks using Keras
Image Similarity with Siamese Networks
Keras Example1
Siamese Network
Survey on Deep Learning Techniques for Person Re-Identification Task
Unsupervised Person Re-identification by Deep Learning Tracklet Association
Enhanced Deep Feature Representation for Person Re-identification
WACV18: Vehicle Re-identification by Adversarial Bi-directional LSTM Network

Survey on Deep Learning Techniques for Person Re-Identification Task
Key Notes
  • On-line applications for people/object detection and tracking
  • Recognizing a suspicious action/behavior from the camera network
  • Off-line applications to support operators and forensic investigators 
Image Challenges
  • Low image resolution
  • Unconstrained pose
  • Illumination changes
  • Occlusions 
Features to Exploit
  • Face
  • Clothing appearance
  • Gait
  • CNN generates a set of feature maps in which each pixel of given image corresponds to a specific feature representation
  • Image Size - 128 × 64
DNN Key Considerations
  • Objective function
  • Loss functions
  • Data augmentation
Feature fusion deep neural network
  • Network takes a single image size of 224 × 224 × 3 as the input of the network
  • Hand-crafted features are extracted by one of the standard person re-identification descriptor
  • Both extracted features are followed by a buffer layer and a fully connected layer which are acting as the fusion layer
  • A softmax loss layer then takes the output vector of fully connected layer in order to minimizing the cross-entropy loss
Siamese network
  • Siamese network models have been widely employed in person re-identification task
  • Employed as pairwise
  • Two subnetworks included
  • Output is similarity score
Tripletmodels
Training sample separately fed into three identical networks with shared parameter set between them
For each triplet unit they organized to maximize the margin between the matched pairs and the mismatched pairs. Hinge loss, Cosine similarity loss, Contrastive loss

Happy Mastering DL!!!!

My Career Journey

2003 - Testing windows OS is great
2004 - Coding C++ MQ adapter is interesting
2005 - Setting up an Application support team and Swiss onsite
2006 - Finding my place in Microsoft, Learning the Domain, Supply Chain
2007 - Performance, SQL Migration, Biztalk and Automation
2008 - DB Developer
2009  - BI / OLTP performance tuning / TSQL developer
2010 -  BI / OLTP performance tuning / TSQL developer
2010 - Need more $$ and challenges
2011 - Setting up Team in Amazon
2012 - Better become Individual Contributor, Again Database and QA, Setting up Team
2013 - Performance, Automation, Database development
2014 - Big data jump and getting into it
2015 - Start from scratch Data science
2016 - Data Science year 2 All 2 year Masters deep dive
2017 - Data Science year 3 Image and Data Analytics projects
2018 - Data Science year 4 Machine Learning projects
2019 - Data Science year 5 Deep learning Projects
2020 - Vision Expert and Deploying Solutions in Scale

To sum up - "Data Guy, Empirical Learner"

The days I spent weekends learning is more than my weekday efforts. I have witnessed the transformation of technology and the tools that evolved all these years. Experimenting and working on different roles, starting things from zero and re-learning has been a rewarding experience. All these years I have learned, relearnt, and still learning. Outside this life also gave me lessons and blessings.

Keep learning, Keep growing!!!

March 12, 2019

Big Data and SQL Journey

I have been a silent observer of Big Data Journey.
  • In 2010 - Ran SQL Queries on RDBMS for Committed Data
  • Entity Framework came into the picture, heated debate Stored procedures are no longer needed
  • Hadoop came, I queried on HBase, Hive for the Real-time / Analytics queries for Current Data and Historical data
  • After a few years now I see the queries in Spark SQL and KSQL for Streaming Passing Data
  • SQL queries are shifting all the way from Database - Real-time - Real-time Stream Querying
  • SQL Skills are coming to every layer of the data processing stack
Querying now goes for - incoming data, passing data, current data, historical data
  • Kafka SQL - SQL for Incoming data
  • Spark SQL - SQL for Passing data
  • HBASE - SQL for Real-time data
  • HIVE - SQL for historical data
All Hail SQL!!!

Happy Learning!!!

March 07, 2019

Day #218 - Working on gluoncv, anaconda and Windows 10

In my base Anaconda 3+ Environment,

pip install gluoncv
pip install mxnet

Download model from https://modelzoo.co/model/ssd-mxnet, This step was not required, Figured out later this was due to below package issues scipy and matplotlib.

Error in Call back
pip uninstall scipy
pip install scipy
pip uninstall matplotlib
pip install matplotlib

Below is example code

Data gets downloaded to location
C:\Users\#user#\.mxnet\models

The Output example is




Happy Mastering DL!!!

March 06, 2019

Day #217 - OpenVino Session

Key Lessons
  • Open Visual Inference Image and Neural Network Optimization Tool Kit
  • Tools & Capabilities for Developers across domains
  • OpenVino is supported only for Intel Devices
  • Models for purpose
  • Models to detect across the frame
  • Hardware for Performance
  • Compute Efficiency / Memory Hierarchy / APIs
  • CPU and Integrated FPGA, GPU solutions



Pipeline flow
  • Decode compressed Image
  • Preprocessing - Scale down for DL model input, ROI computation, Frame Re-ordering
  • Post-Processing - Write bounding boxes on top of it
Inference
  • Training offline activity
  • Model Optimizer would do conversion for CPU, GPU, FPGA
  • Intel Library also added on it
  • Out of Box Models in OpenVino
  • Compile for Target
  • mo.py model optimizer FP16. xml and bin file generated
  • Movidus neural compute stick
  • -d cpu, -d gpu, -d myriad
  • Use of Hetro Plugin - GPU and CPU
  • -d HETRO:GPU,CPU





More Reads - 
https://github.com/intel-iot-devkit/store-traffic-monitor
https://github.com/intel-iot-devkit/smart-video-workshop



Happy Mastering DL!!!

Day #216 - My Date with OpenVino

The Steps are detailed in link

On Running the following Pre-requisites were mentioned. 
1. OpenCL Driver - Intel Driver Update Utility.
2. Visual studio 2017
3. Python 3.6 64 bit

Followed Steps provided on VS2017 Package and Details
conda install python=3.6.5

After installation goto the environment and run
1. Goto Folder and Run C:\Intel\computer_vision_sdk\bin\setupvars.bat


2. Run C:\Intel\computer_vision_sdk_2018.5.456\deployment_tools\model_optimizer\install_prerequisites


3. protobuf not installed error
4. pip install protobug
5. Re-Again - demo_squeezenet_download_convert_run.bat

6. Run Next Demo demo_security_barrier_camera.bat
7. Output is the image



After the Output seems my system and myself are on the same page. The date is Over!!!

Available Models - C:\Intel\computer_vision_sdk_2018.5.456\deployment_tools\computer_vision_algorithms\share\cva

Models located in C:\Intel\computer_vision_sdk_2018.5.456\deployment_tools\computer_vision_algorithms\share\cva\PersonReidentification\doc\examples

Outside the build file it builds executable and inference engine
C:\Users\#user#\Documents\Intel\OpenVINO\inference_engine_samples_2017\intel64\Release\classification_sample.exe -i C:\Intel\computer_vision_sdk_2018.5.456\deployment_tools\demo\\car.png -m "C:\Users\#user#\Documents\Intel\OpenVINO\openvino_models\ir\FP32\classification\squeezenet\1.1\caffe\squeezenet1.1.xml" -d CPU


There is a build file which needs to be understood and built to fix it. I was quickly trying to leverage the exe.

More Examples - https://software.intel.com/en-us/articles/OpenVINO-IE-Samples#multi-channel-face-detection-sample

Happy Mastering DL!!!

March 04, 2019

Spark Lessons #Applying Best Practices to Your Apache Spark Applications - Silvio Fiorito

Key Lessons
  • Spark is lazily executed
  • Apply transformations to query
  • Count, Write, For Each
  • Reader API - Spark.read.load

  • Class - InmemoryFileIndex (Responsible for partition discovery) - S3 / HDFS

  • Anything over 32 folders will kick off job
  • Dealing with Many partitions
  • InMemoryFileIndex to index paths you are interested in
Datasource Tables
  • Managed in Hive Metastore

  • External or unmanaged tables (Hive Schema over existing Dataset)
  • Managed Table (SparkSQL Manages)
  • Hive also keeps track of schema
  • Files / Tables diff


  • Tables - Schema in Metastore
  • For BI users you can use Tables
  • Dealing with CSV / JSON files
  • Scan dataset and creates schema - convenient for slow dataset (schema inference)


Compression and Partition Scheme
  • Depends on Adhoc / Batch
  • Splittable compression schemes
  • Avoid Large Gzip text files


Optimization
  • Partitioning / Bucketing (persist hash partitioned data, good for joins and keys)
  • Each task will write a file in the bucket


  • Repartition by partition by value / column
  • One File per partition

Query Optimization

  • SQL shuffle partition
  • Default value override based on data volume
  • Self-Union Reading dataset twice
  • Cost based optimizer










Happy Learning!!!