Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): October 2017

October 31, 2017

Day #83 - Data Splitting Strategies

Time based splits
Validation to mimic train / test pic
Time based trend - differs significantly, Time based patterns important

Different splitting strategies can differ significantly

In generated features
In a way model will rely on that features
In Some kind of target leak

Split Categories

Random Split (Split randomly by rows, Rows independent of each other), Row wise
Device special features for dependency cases
Timewise - Before particular date as training, After date as testing data. Useful features based on target
Moving window validation
By Id - (By Clustering pictures, grouping them and then finding features)
Combined (Split date for each shop independently)

Summary

In most cases split by Rownumber, Time, Id
Logic for feature generation depends on data splitting strategy
Set up your validation to mimic the train / test split of competition

Happy Learning and Coding!!!

Day #82 - Validation and Overfitting

Train Data (Past), Unseen Test Data (Future)
Divide into three parts - Train (Past), Validation (Past), Test (Future)
Underfitting (High Error on Both Training and Validation)
Overfitting (Doesn't generalize to test data, Low Error on Train, High Error on Validation)
Ideas (Lowest Error on both Training and Testing Data)

Validation Strategies

Hold Out (divide data into training / testing, No overlap between training / testing data ) - Used on Shuffle Data
K-Fold (Repeated hold out because we split our data) - Good Choice for medium amount of data, K- 1 training, one subset - Used on Shuffle Data
Leave one out : ngroups = len(train) - Too Little data (Special case of K fold, K = number of samples)
Stratification - Similar target distribution over different folds

Stratification useful for

Small datasets (Do Random Splits)
Unbalanced datasets
Multiclass classification

Stratification preserves the target distribution over different folds

Happy Coding and Learning!!!

October 30, 2017

Day #81 - Dataset Cleaning

Dataset cleaning

Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
Duplicated categorical features (Encode categorical features and compare them)

Other things to check

Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
Check for common rows in train and test sets (Set labels manually for test rows in training set)
Check if dataset is shuffled (Oscillations around mean would be observed)

EDA Checklist

Get Domain Knowledge
Check How data is generated
Explore individual feature
Explore pairs and groups
Clean features

Happy Learning and Coding!!!

October 29, 2017

Day #80 - Visualizations

EDA is an art. Visualizations are art tools. Several different plots to prove hypothesis

Visualization Tools

Histograms (Split into bins, how many points fall in each bins, vary number of bins) - plt.hist(x)
XGBoost will benefit from explicit missing values
Plots - index versus value, plt.plot(x,'.'), randomness over indices
Statistics

Explore Feature Relations

Scatter Plots (Draw one features vs other), Data distribution between train and test tests validate how they are distributed
Correlation Plots (Run K-means clustering and reorder feature) - How similar features are
Plot (index vs feature statistics)

Feature Groups

Generate new features based on groups

Pairs

ScatterPlot, Scatter matrix
Correlation Plot (Corrplot)

Groups

Corrplot + Clustering
Plot (Index vs feature statistics)

Day #79 - Exploratory Data Analysis (EDA)

EDA

Looking data, Understanding data
Complete data understanding required to build accurate models
Generate Hypothesis / Apply Intuition
Top solutions use Advanced and Aggressive Modelling
Find insights and magic feature, Start with EDA before hardcore modeling

Visualization

Identify Patterns (Visualization to idea)
Use patterns to find better models (Idea to visualization, Hypothesis testing)

EDA Steps

Domain Knowledge (Google, Wikipedia understand data)
Check data is Intuitive (Values in data validate based on acquired domain knowledge, Manual correction of error, Mark incorrect rows and label them for model to leverage it)
Understand how data is generated (Test set / Training set generated by the Same Algorithm ? / Need to know underlying data generation Process / Visualize Training / Test set plots)

Exploring Anonymized and Encrypted Data
Anonymized Data

Replace data with encrypted text (This will not impact model though)
No meaningful names of columns
Find unique values of features, sort them and find differences
Distance between two consecutive features and the pattern for it

Explore Individual Features

Guess the meaning of the columns
Guess the types of the column (Categorical, Boolean, Numeric etc..)

Explore Feature Relations

Find relation between pairs
Find feature groups

Useful Python functions

df.dtypes
df.info()
x.value_counts()
x.isnull()

Happy Learning and Coding!!!

Day #78 - Image Processing - Kaggle Lessons

Use Trained model on data similar
Train network from scratch
Using pretrained model and Fine tune later

VGGNet16 Architecture

Remove Last layer with new one size of 4
Retrain model
Benefit from model trained from similar dataset

Image Augmentation

Increase number of training samples
Image rotations

Happy Learning and Coding!!!

Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text

For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF, Use BOW for Ngrams

Happy Coding and Learning!!!

Day #76 - Text Processing - Kaggle Lessons

Bag of Words

Create new column for each unique word in data
Count occurrences in each documents
sklearn.feature_extraction.text.CountVectorizer
More comparable by using Term Frequency
tf = 1 / x.sum(axis=1)[:,None]
x = x*tf
Inverse Document Frequency
idf = np.log(x.shape[0])/(x>0).sum(0)
N Grams
Bag of Words (Each row represents text, Each column represents unique word)
Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps

Lower case
Lemmatization (using knowledge of vocabulary and morphological analysis of words)
democracy, democratic and democratization -> democracy (Lemmatization)
Stemming (Chops of ending of words)
democracy, democratic, and democratization - democr (Stemming)
Stop words (Not contain important information)

sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words

Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
Ngrams can help use local context
Postprocessing - TFiDF
Use BOW for Ngrams

BOW example

Sentence - The dog is on the table
Representation - are, cat, dog, is, now, on, the, table
BOW representation - 0, 0, 1, 1, 0, 1, 1, 1

BOW Issue

The food was good, not bad at all

The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors

Get vector representation of words and texts
Each word converted to vector
Uses nearby words
Different words used in same context will be used in vector representation
Apply basic operations can be done on vectors
Words - Word2Vec, Glove, FastText
Sentences - Doc2Vec
There are pretrained models

Bag of Words

Very large vectors
Meaning of each value in vector is unknown

Word2Vec

Relatively small vectors
Values of vector can be interpreted only in some cases
The words with similar meaning often have similar embeddings

Happy Learning, Happy Coding!!!

October 27, 2017

Day #75 - Missing Values

Reasons for Missing Values
How to Engineer them effectively
Hidden Missing Values
Plot distribution of values and find from histogram

Filling missing Values

-999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
mean, median
Reconstruct value
add isnull column

Reconstruction

Missing values in timeseries
Temperature values missing for some days of month
Based on increase / decrease pattern
Ignore missing value while calculating mean
Change Categories to frequencies
XGBoost can handle NAN

Happy Learning and Coding!!!

Day #74 - Feature Generation - DateTime and Coordinates

DateTime

Differ Significantly between numeric and categorial features
Periodicity - Year, Date, Week
Time Slice - Time past since particular moment (before / after), Time moments in period
Difference in Dates (Datetime_feature1 - Datetime_feature2)
Special Time period (Medication every 3 days)
Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
Boolean binary indicating date is holiday or not
Sales Context Churn Prediction
(Date Since user registration) - DateDiff
(Date Since last purchase) - DateDiff
(Date Since calling customer service) - DateDiff
Periodicity - Day number in week, month, season, year, second, minute, hour
Time Slice, Difference between dates

Coordinates

This can be used for churn prediction (Likelihood customer will return)
In Real Estate Scenario for predictions on Prices
(Distance from School)
(Distance from Airport)
(Flats around particular point)
Alternatively distance from maximum expensive flat
Centre of clusters and find distances from centre point
Aggregated Statistics for surrounding data

Happy Learning and Coding!!!

Day #73 - Feature Generation - Categorical and ordinal features

Label Encoding - Based on Sort Order, Order of Appearance
Frequency Encoding - Based on Percentage of occurence

Categorical Features

Sex, Cabin, Embarked
One Hot Encoding
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder
Works well for Linear methods (Minimum is zero, Maximum is 1)
Difficult for Tree methods based on One Hot Encoding Approach
Store only Non-Zero Elements (Sparse Matrices)
Create combination of features and get better results
Concatenate strings from both columns
One hot encoding it, Find optimal coefficient for every interaction

pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features

Ordered categorial feature
First class expensive, second less, third least expensive
Drivers License Type A,B,C,D
Level of Education (Sorted in increasingly complex order)
Label Encoding, Map to numbers (Tree based)
Non Tree can't use effectively

Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.

Equal Distributiona apply rank ties
from scipy.stats import rankdata

Summary

Ordinal is special case of categorial feature
Label Encoding maps categories to numbers
Frequency encoding maps categories to frequencies
Label and frequency encoding are used for Tree based models
One-Hot encoding is used for non-tree based models
Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

Day #72- Feature Generation - Numeric Features

Feature Generation

Predict Apple Sales (Linear Trend)
Examples - Add features indicating week number, GBDT will consider min calculated value for each week
Created Generated Tree

Numeric Features - Preprocessing

Tree based Methods (Decision Tree)
Non Tree based Methods (NN, Linear Model, KNN)

Technique #1 - Scaling of values

Apply Regularization in equal amounts
Do proper scaling

Min Max Scalar

To [0,1]
sklearn.preprocessing.MinMaxScaler
X = (X-X.min())/(X.max()-X.min())

Standard Scaler

To mean = 0, std = 1
sklearn.preprocessing.StandardScaler
X = (X-X.mean())/X.std()

Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers

Calculate lower and upper bound values
Rank transformation
Better option than Min-Max Scale

Ranking, Transformations

scipy.stats.rankdata
Log transformation - np.log(1+x)
Raising to power < 1 - np.sqrt(x+2/3)

Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)

Creating new features
Engineer using prior knowledge and logic
Example, Adding price per square feet if price and size of plot is provided

Summary

Tree based methods don't depend on scaling
Non-Tree methods hugely depend on scaling

Most often used preprocessing

MinMaxScaler - to [0,1]
StandardScaler - to mean==0, std==1
Rank - sets spaces between sorted values to be equal
np.log(1+x) and np.sqrt(1+x)

Happy Learning and Coding!!!

Day #71 - Kaggle Best Practices

After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition

Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
Submissions - Compare against models and predictions submitted
Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions

Analyze data
Fit model
Submit
See Public Score

Data Science Competition Platforms

kaggle
DrivenData
CrowdAnalityx
CodaLab
DataScienceChallenge.net
DataScience.net
KDD
VizDooM

Session #2
Using Kaggle

Data format and explanations
Evaluation Criteria
Sample Submission File
Timelines page

Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages

Understand business problem
Problem Formulation
Collect Data, Mine Examples
Clean Data
Preprocess it
Identify Model for Task
Select best models
Accuracy
Deploy model (make it available to users)
Monitor and retrain with new data

Kaggle

All data collected and problems fixed
Model creation and evaluation

Summary

Real world problems are complicated
Competition are a great way to learn
But Kaggle competitions don't address the questions of formalization, deployment and testing

Key insights

Importance of understaing data
Tools to users
Try (Complex solutions, Advance feature engineering, doing huge calculation)

Session #3

Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech

Session #4 - Data Preprocessing

Preprocess for feature engineering
Basic feature generation for different types of features
Numeric, Categorical, DateTime based features

Features

(0/1) - Binary Features
Numeric features (Age, fare)
Categorical (Classes)

Feature Preprocessing

Each feature has own ways to be preprocessed
Depends on model to use
Linear models not for two class features
One hot encoder
Random forest can easily put each class seperately and predict each probability

Data Types

Structured Data

Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
Numerical - Specific Numeric Data
Continuous - Petrol Prices continuous data
Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!

October 31, 2017

October 30, 2017

October 29, 2017

October 27, 2017

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts