"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 22, 2018

Day #96 - Mean Encoding - Extensions and Generalizations

  • Compact transformation of categorical variables
  • Powerful basis of feature engineering
Using target variable in different tasks. Regression, Multi-class
  • More stats - Percentiles, std, distribution bins
  • Introducing new information from one vs all classifiers in multi-class tasks (N Different encodings)
Domains with many-to-many relationships
  • User to Apps relationships
  • Row for user-app relationship
  • Vector for each app`
Time-series
  • Presence of mean prev da, prev week, prev day
  • Based on data create more complicated features
Encoding interactions and numerical features
  • model structure, analyzing trees
  • Extract from decision trees (If they are in neighboring nodes)
  • xgboost, row features
  • Use split points to identify new features
  • Manually add more mean encoded interactions
  • Involving categorical variables evaluate variable interactions
Correct validation reminder
Local experiments
  • Estimate encodings on X_tr
  • Map them to X_tr and X_val
  • Regularize on X_tr
  • Validate mode on X_tr / X_val split
Submission
  • Estimate Encoding on whole Train data
  • Map them to Train and Test
  • Regularize on Train
  • Fit on Train
Happy Learning!!!

No comments: