"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 31, 2017

Day #83 - Data Splitting Strategies

  • Time based splits
  • Validation to mimic train / test pic
  • Time based trend - differs significantly, Time based patterns important
Different splitting strategies can differ significantly
  • In generated features
  • In a way model will rely on that features
  • In Some kind of target leak
 Split Categories
  •  Random Split (Split randomly by rows, Rows independent of each other), Row wise
  • Device special features for dependency cases
  • Timewise - Before particular date as training, After date as testing data. Useful features based on target
  • Moving window validation
  • By Id - (By Clustering pictures, grouping them and then finding features)
  • Combined (Split date for each shop independently)
Summary
  • In most cases split by Rownumber, Time, Id
  • Logic for feature generation depends on data splitting strategy
  • Set up your validation to mimic the train / test split of competition
Happy Learning and Coding!!!

Day #82 - Validation and Overfitting


  • Train Data (Past), Unseen Test Data (Future)
  • Divide into three parts - Train (Past), Validation (Past), Test (Future)
  • Underfitting (High Error on Both Training and Validation)
  • Overfitting (Doesn't generalize to test data, Low Error on Train, High Error on Validation)
  • Ideas (Lowest Error on both Training and Testing Data)
Validation Strategies
  • Hold Out (divide data into training / testing, No overlap between training / testing data ) - Used on Shuffle Data
  • K-Fold (Repeated hold out because we split our data) - Good Choice for medium amount of data, K- 1 training, one subset - Used on Shuffle Data
  • Leave one out : ngroups = len(train) - Too Little data (Special case of K fold, K = number of samples)
  • Stratification - Similar target distribution over different folds
Stratification useful for
  • Small datasets (Do Random Splits)
  • Unbalanced datasets
  • Multiclass classification
 Stratification preserves the target distribution over different folds

Happy Coding and Learning!!!

October 30, 2017

Day #81 - Dataset Cleaning


Dataset cleaning
  • Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
  • Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
  • Duplicated categorical features (Encode categorical features and compare them)
Other things to check
  • Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
  • Check for common rows in train and test sets (Set labels manually for test rows in training set)
  • Check if dataset is shuffled (Oscillations around mean would be observed)
EDA Checklist
  • Get Domain Knowledge
  • Check How data is generated
  • Explore individual feature
  • Explore pairs and groups
  • Clean features
Happy Learning and Coding!!!

October 29, 2017

Day #80 - Visualizations

EDA is an art. Visualizations are art tools. Several different plots to prove hypothesis

Visualization Tools
  • Histograms (Split into bins, how many points fall in each bins, vary number of bins) - plt.hist(x)
  • XGBoost will benefit from explicit missing values
  • Plots - index versus value, plt.plot(x,'.'), randomness over indices
  • Statistics
Explore Feature Relations
  • Scatter Plots (Draw one features vs other), Data distribution between train and test tests validate how they are distributed
  • Correlation Plots (Run K-means clustering and reorder feature) - How similar features are
  • Plot (index vs feature statistics)
Feature Groups
  • Generate new features based on groups
Pairs
  • ScatterPlot, Scatter matrix
  • Correlation Plot (Corrplot)
 Groups
  •  Corrplot + Clustering
  •  Plot (Index vs feature statistics)

More Read (Link)




Happy Learning and Coding!!!

Day #79 - Exploratory Data Analysis (EDA)

EDA
  • Looking data, Understanding data
  • Complete data understanding required to build accurate models
  • Generate Hypothesis / Apply Intuition 
  • Top solutions use Advanced and Aggressive Modelling
  • Find insights and magic feature, Start with EDA before hardcore modeling
Visualization
  • Identify Patterns (Visualization to idea)
  • Use patterns to find better models (Idea to visualization, Hypothesis testing)
EDA Steps
  • Domain Knowledge (Google, Wikipedia understand data)
  • Check data is Intuitive (Values in data validate based on acquired domain knowledge, Manual correction of error, Mark incorrect rows and label them for model to leverage it)
  • Understand how data is generated (Test set / Training set generated by the Same Algorithm ? / Need to know underlying data generation Process / Visualize Training / Test set plots)
Exploring Anonymized and Encrypted Data
Anonymized Data
  • Replace data with encrypted text (This will not impact model though)
  • No meaningful names of columns
  • Find unique values of features, sort them and find differences
  • Distance between two consecutive features and the pattern for it
Explore Individual Features
  • Guess the meaning of the columns
  • Guess the types of the column (Categorical, Boolean, Numeric etc..)
Explore Feature Relations
  • Find relation between pairs
  • Find feature groups
Useful Python functions
  • df.dtypes
  • df.info()
  • x.value_counts()
  • x.isnull()
Happy Learning and Coding!!!

Day #78 - Image Processing - Kaggle Lessons

  • Use Trained model on data similar
  • Train network from scratch
  • Using pretrained model and Fine tune later
VGGNet16 Architecture
  • Remove Last layer with new one size of 4
  • Retrain model
  • Benefit from model trained from similar dataset
Image Augmentation
  • Increase number of training samples
  • Image rotations
Happy Learning and Coding!!!

Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text

  • For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
  • For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
  • For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF,  Use BOW for Ngrams
Happy Coding and Learning!!!

Day #76 - Text Processing - Kaggle Lessons


Bag of Words
  • Create new column for each unique word in data
  • Count occurrences in each documents
  • sklearn.feature_extraction.text.CountVectorizer
  • More comparable by using Term Frequency
  • tf = 1 / x.sum(axis=1)[:,None]
  • x = x*tf
  • Inverse Document Frequency
  • idf = np.log(x.shape[0])/(x>0).sum(0)
  • N Grams
  • Bag of Words (Each row represents text, Each column represents unique word)
  • Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps
  • Lower case
  • Lemmatization (using knowledge of vocabulary and morphological analysis of words)
  • democracy, democratic and democratization -> democracy (Lemmatization)
  • Stemming (Chops of ending of words)
  • democracy, democratic, and democratization - democr (Stemming)
  • Stop words (Not contain important information)
sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words
  • Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
  • Ngrams can help use local context
  • Postprocessing - TFiDF
  • Use BOW for Ngrams
BOW example
  • Sentence - The dog is on the table
  • Representation         - are, cat, dog, is, now, on, the, table
  • BOW representation  - 0,    0,    1,    1,     0,      1,    1,    1
BOW Issue

The food was good, not bad at all
The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors
  • Get vector representation of words and texts
  • Each word converted to vector
  • Uses nearby words
  • Different words used in same context will be used in vector representation
  • Apply basic operations can be done on vectors
  • Words - Word2Vec, Glove, FastText
  • Sentences - Doc2Vec
  • There are pretrained models
Bag of Words
  • Very large vectors
  • Meaning of each value in vector is unknown
Word2Vec
  • Relatively small vectors
  • Values of vector can be interpreted only in some cases
  • The words with similar meaning often have similar embeddings
Happy Learning, Happy Coding!!!

October 27, 2017

Day #75 - Missing Values

  • Reasons for Missing Values
  • How to Engineer them effectively
  • Hidden Missing Values
  • Plot distribution of values and find from histogram
Filling missing Values
  • -999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
  • mean, median
  • Reconstruct value
  • add isnull column
Reconstruction
  • Missing values in timeseries
  • Temperature values missing for some days of month
  • Based on increase / decrease pattern
  • Ignore missing value while calculating mean
  • Change Categories to frequencies
  • XGBoost can handle NAN
Happy Learning and Coding!!!

Day #74 - Feature Generation - DateTime and Coordinates

DateTime
  • Differ Significantly between numeric and categorial features
  • Periodicity - Year, Date, Week
  • Time Slice - Time past since particular moment (before / after), Time moments in period
  • Difference in Dates (Datetime_feature1 - Datetime_feature2)
  • Special Time period (Medication every 3 days)
  • Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
  • Boolean binary indicating date is holiday or not
  • Sales Context Churn Prediction
  •     (Date Since user registration) - DateDiff
  •     (Date Since last purchase) - DateDiff
  •     (Date Since calling customer service) - DateDiff
  • Periodicity - Day number in week, month, season, year, second, minute, hour
  • Time Slice, Difference between dates
Coordinates
  • This can be used for churn prediction (Likelihood customer will return)
  • In Real Estate Scenario for predictions on Prices
  •     (Distance from School)
  •     (Distance from Airport)
  •     (Flats around particular point)
  • Alternatively distance from maximum expensive flat
  • Centre of clusters and find distances from centre point
  • Aggregated Statistics for surrounding data
Happy Learning and Coding!!!

Day #73 - Feature Generation - Categorical and ordinal features

  • Label Encoding - Based on Sort Order, Order of Appearance
  • Frequency Encoding - Based on Percentage of occurence
Categorical Features
  • Sex, Cabin, Embarked
  • One Hot Encoding
  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder
  • Works well for Linear methods (Minimum is zero, Maximum is 1)
  • Difficult for Tree methods based on One Hot Encoding Approach
  • Store only Non-Zero Elements (Sparse Matrices)
  • Create combination of features and get better results
  • Concatenate strings from both columns
  • One hot encoding it, Find optimal coefficient for every interaction
pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features
  • Ordered categorial feature
  • First class expensive, second less, third least expensive
  • Drivers License Type A,B,C,D
  • Level of Education (Sorted in increasingly complex order)
  • Label Encoding, Map to numbers (Tree based)
  • Non Tree can't use effectively
Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
 - sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
 - Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
  • Equal Distributiona apply rank ties
  • from scipy.stats import rankdata
Summary
  • Ordinal is special case of categorial feature
  • Label Encoding maps categories to numbers
  • Frequency encoding maps categories to frequencies
  • Label and frequency encoding are used for Tree based models
  • One-Hot encoding is used for non-tree based models
  • Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

Day #72- Feature Generation - Numeric Features

Feature Generation
  • Predict Apple Sales (Linear Trend)
  • Examples - Add features indicating week number, GBDT will consider min calculated value for each week
  • Created Generated Tree
Numeric Features - Preprocessing
  • Tree based Methods (Decision Tree)
  • Non Tree based Methods (NN, Linear Model, KNN)
Technique #1 - Scaling of values
  • Apply Regularization in equal amounts
  • Do proper scaling
Min Max Scalar
  • To [0,1]
  • sklearn.preprocessing.MinMaxScaler
  • X = (X-X.min())/(X.max()-X.min())
Standard Scaler
  • To mean = 0, std = 1
  • sklearn.preprocessing.StandardScaler
  • X = (X-X.mean())/X.std()
Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers
  • Calculate lower and upper bound values
  • Rank transformation
  • Better option than Min-Max Scale
Ranking, Transformations
  • scipy.stats.rankdata
  • Log transformation  - np.log(1+x)
  • Raising to power < 1 - np.sqrt(x+2/3)
Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)
  • Creating new features
  • Engineer using prior knowledge and logic
  • Example, Adding price per square feet if price and size of plot is provided
Summary
  • Tree based methods don't depend on scaling
  • Non-Tree methods hugely depend on scaling
Most often used preprocessing
  • MinMaxScaler - to [0,1]
  • StandardScaler - to mean==0, std==1
  • Rank - sets spaces between sorted values to be equal
  • np.log(1+x) and np.sqrt(1+x)
 Happy Learning and Coding!!!

Day #71 - Kaggle Best Practices


After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition
  • Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
  • Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
  • Submissions - Compare against models and predictions submitted
  • Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
  • Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions
  • Analyze data
  • Fit model
  • Submit
  • See Public Score
Data Science Competition Platforms
  • kaggle
  • DrivenData
  • CrowdAnalityx
  • CodaLab
  • DataScienceChallenge.net
  • DataScience.net
  • KDD
  • VizDooM
Session #2
Using Kaggle
  • Data format and explanations
  • Evaluation Criteria
  • Sample Submission File
  • Timelines page
Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages
  • Understand business problem
  • Problem Formulation
  • Collect Data, Mine Examples
  • Clean Data
  • Preprocess it
  • Identify Model for Task
  • Select best models
  • Accuracy
  • Deploy model (make it available to users)
  • Monitor and retrain with new data

Kaggle
  • All data collected and problems fixed
  • Model creation and evaluation
Summary
  • Real world problems are complicated
  • Competition are a great way to learn
  • But Kaggle competitions don't address the questions of formalization, deployment and testing
Key insights
  • Importance of understaing data
  • Tools to users
  • Try (Complex solutions, Advance feature engineering, doing huge calculation)
Session #3
  • Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
  • Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
  • KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
  • Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech
Session #4 - Data Preprocessing
  • Preprocess for feature engineering
  • Basic feature generation for different types of features
  • Numeric, Categorical, DateTime based features
Features
  • (0/1) - Binary Features
  • Numeric features (Age, fare)
  • Categorical (Classes)
Feature Preprocessing
  • Each feature has own ways to be preprocessed
  • Depends on model to use
  • Linear models not for two class features
  • One hot encoder
  • Random forest can easily put each class seperately and predict each probability

Data Types
Structured Data
  • Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
  • Numerical - Specific Numeric Data
  • Continuous - Petrol Prices continuous data
  • Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!