Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): January 2018

January 27, 2018

Day #100 - Ensemble Methods

It took more than a year to reach 100 posts. This is a significant milestone. Hoping to reach 200 soon.

Combining different machine learning models for more powerful prediction
Averaging or blending
Weighted averaging
Conditional averaging
Bagging
Boosting
Stacking
Stacknet

Averaging ensemble methods

Combine two results with simple averaging
(model1+model2)/2
Considerable improvements with averaging can be achieved
Perform better when combined but not individually
Weighted average - (model1*0.7 + model2*0.3)
Conditional average (If < 50 use model1 else model2)

Bagging

Averaging slightly different versions of same model to improve accuracy
Example - Random Forest
Underfitting - Error in bias
Overfitting - Errors in variance
Parameters that control bagging - Seed, Subsampling or Bootstrapping, Shuffling, Column Subsampling, Model specific parameters, bags (number of models), More bags better results, parallelism
BaggingClassifier and BaggingRegressor from sklearn
Independant of each other

Boosting

Weight based boosting
Form of weighted averaging of models where each model is built sequentially via taking into account of past model performance
Add sequentially how well previous models have done

Weight based boosting

Number of times certain row appears in data
Contribution to error / recalculate weights
Parameters - Learning rate, shrinkage, trust many models, number of estimators
Parameters - Adaboost (sklearn - python), LogitBoost (Weka - java)

Residual based boosting

For videos mostly dominant
Calculate error of predictions / direction of error
Make Error new target variable
Parameters - Learning Rate, Shrinkage, ETA
Number of estimators
Row sub sampling
Column sub sampling
Sub boosting type - Fully gradient based, Dart
XGboost
Lightgbm
H2O GBM (Handle categorical variables out of box)
Catboost
Sklearn's GBM

Stacking

Making several predictions of a number of models in a hold out set and then using a different meta model to train these predictions
Stacking predictions
Splitting training set into two disjoint sets
Train several base learners on the first part
Make predictions with the base learners on the second (validation) part
Using predictions from (3) as the input to train a higher level learner
Train Algo 0 on A and make predictions for B and C and Save to B1, C1
Train Algo 1 on A and make predictions for B and C and Save to B1, C1
Train Algo 2 on A and make predictions for B and C and Save to B1, C1
Train Algorithm3 on B1 and make predictions for C1

Happy Learning!!!

Day #99 - Statistics and distance based features

Stats

Calculate statistics of derived features from neighborhood analysis
User_id / Page_id / Ad_price / Ad_position
Use label encoder
Treat data points implicitly
Add lowest and highest price for position of add
maximum and minimum price values
Pages user visited
Standard deviation of prices
Most visited page
Many more features
Introduce new information

Neighbors

Number of houses in 500m, 1000m
Average price per sq.m
Number of schools / supermarkets / parking lots in 500m / 1000m
Distance to closest substation
Embrace both group-by and nearest neighbor methods

Matrix Factorizations

Approach for feature extraction
User / Items mapping matrix
User - Attributes matrix
U X M = R
Row and column related features
BOW represent larger parse vector
Document represented by small dense vector (Dimensionality reduction)
Matrix Factorizations
SVD, PCA, TruncatedSVD for sparse matrices
NMF (Non-Negative Matrix Factorization) - Zero or Positive Number
NMF makes data suitable for decision trees
Used for Dimensionality reduction

Example Code

x_all = np.cancatenate([x_train,x_test])

pca.fit(x_all)

x_train_pca = pca.transform(x_train)

x_test_pca = pca.transform(x_test)

Happy Learning!!!

January 25, 2018

Day #98 - Advanced Hyperparameter tuning

Neural Network Libraries

Keras (Easy to learn)
Tensorflow (For production this is used)
MxNet
PyTorch (Popular in community)
sklearn's MLP

Neural Nets

Number of neurons per layer
Number of layers
Optimizers
SGD + momentum
Adam / Adadelta / Adagrad (In practice lead to more overfitting)
Batch size (Huge batch size leads to overfitting)
Epochs impact
Learning rate - not too high not too low, Rate where network converges
Regularization

L2/L1 for weights
Dropout / Dropconnect
Static dropconnect

Linear Models (Scikit-learn)

SVC / SVR
Sklearn wraps libLinear and libSVM
Compile yourself for multicore support
LogisticRegression / LinearRegression + regularizers
SGDClassifier / SGDRegressor
Vowpal Rabbit
Regularization parameter (C, alpha, lambda)
Start with very small value and increase it
SVC starts to work slower as C increases
Regularization type

L1/L2/L1+L2 - try each
L1 can be used for feature selection

Happy Learning!!!

January 24, 2018

Day #97 - Hyperparameter tuning

How to tune hyper parameters ?

Which parameters affect most
Observe impact of change of value of parameter
Examine and iterate to find change of impacts

Automatic Hyper-parameter tuning libraries

Hyperopt
Scikit-optimize
Spearmint
GPyOpt
RoBO
SMAC3

Hyper parameter tuning

Tree Based Models (Gradient Boosted Decision Trees - XGBoost, LightGBM, CatBoost)
RandomForest / ExtraTrees

Neural Nets

Pytorch, Tensorflow, Keras

Linear Models

SVM, Logistic Regression
Vowpal, Wabbitm FTRL

Approach

Define function that will run our model
Specify range of hyper parameter
Adequate range for search

Results

Underfitting
Overfitting
Good Fit and Generalization

Tree based Models

GBDT - XGBoost, LightGBM, CatBoost
RandomForest, ExtraTrees - Scikit-learn
Others - RGF(baidu / fast_rgf)

GBDT

XGBoost - max_depth, subsample, colsample_bytree, colsample_bylevel, min_child_weight, lambda, alpha, eta num_round, seed
LightGBM - max_depth / num_leaves, bagging_fraction, feature_fraction, min_data_in_leaf, lambda_l1, lamda_l2, learning_rate num_iterations, seed
sklearn.RandomForest/ExtraTrees - N_estimators, max_depth, max_features, min_samples_leaf, n_jobs, random_state

Happy Learning!!!

January 22, 2018

Day #96 - Mean Encoding - Extensions and Generalizations

Compact transformation of categorical variables
Powerful basis of feature engineering

Using target variable in different tasks. Regression, Multi-class

More stats - Percentiles, std, distribution bins
Introducing new information from one vs all classifiers in multi-class tasks (N Different encodings)

Domains with many-to-many relationships

User to Apps relationships
Row for user-app relationship
Vector for each app`

Time-series

Presence of mean prev da, prev week, prev day
Based on data create more complicated features

Encoding interactions and numerical features

model structure, analyzing trees
Extract from decision trees (If they are in neighboring nodes)
xgboost, row features
Use split points to identify new features
Manually add more mean encoded interactions
Involving categorical variables evaluate variable interactions

Correct validation reminder
Local experiments

Estimate encodings on X_tr
Map them to X_tr and X_val
Regularize on X_tr
Validate mode on X_tr / X_val split

Submission

Estimate Encoding on whole Train data
Map them to Train and Test
Regularize on Train
Fit on Train

Happy Learning!!!

January 09, 2018

Day #95 - Work Hacks - Machine Learning Labelling - Quick Work Hacks (Rows comma delimited text to columns)

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

January 27, 2018

Day #100 - Ensemble Methods

Day #99 - Statistics and distance based features

January 25, 2018

Day #98 - Advanced Hyperparameter tuning

January 24, 2018

Day #97 - Hyperparameter tuning

January 22, 2018

Day #96 - Mean Encoding - Extensions and Generalizations

January 09, 2018

Day #95 - Work Hacks - Machine Learning Labelling - Quick Work Hacks (Rows comma delimited text to columns)

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts