Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): November 2017

November 28, 2017

Day #92 - Mean Encoding

Mean Coding

Add new variables based on certain features
Label encoding is done usually
Mean encoding is done as variable count / distinct unique variables
The proportion of label encoding also is included in this step
Min encoding with label encoding
Label encoding - No logical order
Mean encoding - Classes are separable
We can reach better loss with sorted trees
Trees need huge number of splits
Model tries to treat all categories differently

Constructing Mean Encoding

Goods - Number of ones in a group
Bads - Number of zeros

Likelihood = Goods/(Goods + Bads) = mean(target)
Weight of Evidence = In(Goods/Bads)*100
Count = Goods = sum(target)
Diff = Goods-Bads

Happy Learning!!!

November 24, 2017

Database Sharding and Scalability Basics

Some Key considerations for NOSQL Vs RDBMS

Performance - Latency tolerance, How slow my queries can run for huge data sets
Durability - Data loss tolerance when database crashes losing in-memory or Lost transactions tolerance
Consistency - Weird results tolerance (Dirty data tolerance)
Availability - Downtime tolerance

Options for Scalability

Replication - Create copies of database, Application can talk to either database
Sharding - Sharding choosing a partition key, Key-value store partition based on key
Caching - Precomputed and stored, Manage cache expiration time and refresh logic

For streaming data we had already discussed Events Hub, Apache kafka. Now we have something called KSQL (Kafka streaming SQL to run on continuous data)

Great Session Talk

RDBMS VS NOSQL Considerations, Quick Summary

Performance - Latency tolerance
Durability - Data loss tolerance
Consistency - Weird results tolerance (Dirty data tolerance)
Availability - Downtime tolerance

Happy Learning!!!

November 16, 2017

Day #91- Retail Analytics - Data Mining / Analytics

Running a successful #Retail Store has a lot of Data Mining / Analytics challenges to solve and arrive at decisions based on data. Some of interesting Retail Data Mining / Analytics problems are

What sells best in each store with item level details
What are shopping time/routine for particular store
Using web data identify the relevance of shopping district / retail environment
What are money making items in the store (Quantity vs Price)
What is Sales / Stock ratio?
What is the forecast value of minimum orders for items in each store based on sales/traffic trends?
What is the correlation between Loss items, Shopping days/periods / people movements?
What is the retail price points identified based on End of Season Sales ?Forecasts / Predictions come as next steps after Data Analysis

Happy Analytics!!!

November 15, 2017

Day #90 - Regression Metrics Optimization

RMSE, MSE, R-Squared (Sometimes called L2 Loss)
Tree-Based

XGBoost, LightGBM
sklearn.RandomForestRegressor

Linear Models

sklearn.<>Regression
sklearn.SGDRegressor

Neural Networks

PyTorch
Keras

MAE (L1, Medial Regression)
Tree-Based

LightGBM
sklearn.RandomForestRegressor

MSPE, MAPE

MSPE is weighted version of MSE
MAPE is weighted version of MAE

Happy coding and learning!!!

November 14, 2017

Day #89 - Capsule networks

Key lessons

Instead of adding layers it nests layers inside it
We apply non-linearity to grouped neuros (capsule)
Dynamic routing - Replace scalar output feature detector of CNN by routing by agreement based on output

CNN History

Latest paper on capsule networks
Offers state of art performance for MNIST dataset
Convolutional networks - Learn mapping for input data and output label
Convolution layer - Series of matrix multiplication and summation operation, Output feature map (bunch of learned features from image)
RELU - Apply non-linearity to it (Network can learn both linear and non-linear functions). Solves vanishing gradient problem. (As gradeient is backpropagating its getting smaller and smaller, RELU prevents it)
Pooling - Creates sections and take maximum pixel value from each sections
Each line of code corresponds to layers in networks
Dropout - Neurons randomly turned on to prevent overfits (Regularization technique)
For handling rotations - AlexNet added different rotations to generalize to different rotations
Deeper networks improved classification accuracy
VGGnet adding more layers
Googlenet - Convolution with different sizes processed on same input, Several of those together
Resnet - Instead of stacking layers, Add operation improved vanishing gradient problem

Convolutional Network Challenges

As we go up the hierarchy each of features learnt will be more complex
Hierarchy happening with each layers
Sub-sampling loses spatial relationships
Spatial correlations are missed in sub-sampling and pooling
Bad for rotated images (Invariance issues)

Capsule Networks

Basic idea - Human brain attains transnational invariance in a better way, Instead of adding layers it nests layers inside it
Nested layer is called capsule, group of neurons
CNN route by pooling
Deeper in terms of nesting

Layer based squashing

Based on output neuron we apply non-linearity
We apply non-linearity to grouped neuros (capsure)

Dynamic routing

Replace scalar output by routing by agreement
Hierarchy tree of nested layers

Key difference - All iterations to compute output, For every capsule nested apply operations
Happy coding and learning!!!

Day #88 - Metrics Optimization

Loss vs Metric

Metric - Function which we want to use to evaluate the model. Maximum accuracy in classification
Optimization Loss - Easy to optimize for given model, Function our model optimizes. MSE, LogLoss
Preprocess train and optimize another metric - MSPE, MAPE, RMSLE
Optimize another metric postprocess predictions - Accuracy, Kapps
Early Stopping - Stop traning when models starts to overfit

Custom loss functions

Accuracy Metrics

Happy Coding and Learning!!!

November 10, 2017

Day #87 - Classification Metrics

Accuracy (Essential for classification), Weighted Accuracy = Weighted Kappa
Logarithmic Loss (Depends on soft predictions probabilities)
Area under Receiver Operating Curve (Considers ordering of objects, tries all threshold to convert soft predictions to hard labels)
Kappa (Similar to R Squared)

Notations
N - Number of objects
L - Number of classes
y - Ground truth
yi - Predictions
[a = b] - indicator function

Soft labels (soft predictions) are classifier's scores - Probabilities of objects
Hard Labels (hard predictions) - argmax fi(x), [f(x)>b], b - threshold for binary classification, Predict label, maximum value from soft prediction and set class for prediction label. Function of soft label

Accuracy Score

Most referred measure of classifier quality
Higher is better
Need hard predictions
Number of correctly guessed objects
Argmax of soft predictions

Logloss

Work with soft predictions
Make classifier output posterior probabilities
Penalises for wrong answers
Set constant to frequencies of each class

Area Under Curve

Based on threshold decide percentage of above / below the threshold
Metric tries all possible ones and aggregate scores
Depends on order of objects

AUC - ROC

Compute TruePositive, FalsePositive
AUC max value 1
Fraction of correctly ordered pairs

AUC = Fraction of correctly ordered pairs / total number of pairs
= 1 - (Fraction of incorrectly ordered pairs / total number of pairs)

Cohen's Kappa

Score = 1- ((1-accuracy)/(1-baseline))
Baselines different for each data
Similar to R squared
Here R predictions for dataset used as baseline
Error = (1- Accuracy)
Weighted Error Score = Confusion matrix * Weight matrix and sum their results
Weighted Kappa = 1 - ((weighted error)/(weighted baseline error))
Useful for medical applications

Ref - Link

Happy Learning and Coding!!!

November 09, 2017

Day #86 - Regression Metrics

Relative Errors most important to us
MSW, MAE work with absolute error not for relative errors
MSPE (mean square percentage error)
MAPE (mean absolute percentage error) - Weighted version of MAE
RMSLE (Root mean square lograthmic error) - RMSE calculated in lograthmic scale - Cares about relative errors

Happy Coding and Learning!!!

November 07, 2017

Day #85 - Regression Metrics Optimization

Metrics

Metrics used to evaluate submissions
Best result finding optimal hyperplane
Exploratory metric analysis along with data analysis
Own ways to measure effectiveness of algorithms

Regression - Metrics

Mean Aquare Error
RMSE
R Squared
Same from optimization perspective

Classification

Accuracy
LogLoss
AUC
Cohen's Kappa

Regression Metrics
N - Samples
y - target values
y~ - target Predictions
yi - target ith value
yi~ - prediction ith object

Mean Square Error
MSE = 1/N(yi - yi~)^2
- Average the squared differences between actuals and targets

RMSE - Root Mean square Error = Sqrt(MSE)

Same as scale of target
RMSE vs MSE
Similar in terms of minimizers
Every RMSE minimizer is MSE minimizer
MSE(a) > MSE(b) <=> RMSE(a) > RMSE(b)
MSE orders in same way as RMSE
MSE easier to work with
Bit of difference in gradient based model
They may not be interchargeable for learning methods (learning rate)

R Squared

How much model is better than constant baseline
1 predictions perfect
WHEN MSE is 0, R Square = 1
All reasonable models score between 0 and 1

MAE - Mean Absolute Error

Avg of absolute difference value between target and predictions
Widely used in Finance
10$ Error twice worse than 5$ Error
MAE easier to justify
Median of target values useful for MAE
MAE gradient step function -1 smaller than target, +1 when greater than target
MAE is not differentiable

MAE vs MSE

For outliers - use MAS
unexpected but normal MSE
MAE robust to outliers

Happy Learning and Coding!!!

November 05, 2017

Day #84 - Data Leaks and Validations

Mimic Train / Test Splot as the test data
Perform KFold Validations
Choose best parameters for models
Submission Stage (Can't mimic exact train / test split)
Calculate mean and standard deviations of leader board scores

Data Leaks

Unexpected information in data that lets you make good predictions
Unusable in real world
Results of unintentional error

Time Series

Incorrect timesplits still exists
Check public and private splits
Missing feature columns are data leaks

Unexpected Information

Use File creation dates
Resize features / change creation date
ID's no sense to include in model

Happy Learning and Coding!!!

November 28, 2017

November 24, 2017

November 16, 2017

November 15, 2017

November 14, 2017

November 10, 2017

November 09, 2017

November 07, 2017

November 05, 2017

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts