"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 28, 2017

Day #92 - Mean Encoding

Mean Coding
  • Add new variables based on certain features
  • Label encoding is done usually
  • Mean encoding is done as variable count / distinct unique variables
  • The proportion of label encoding also is included in this step
  • Min encoding with label encoding
  • Label encoding - No logical order
  • Mean encoding - Classes are separable
  • We can reach better loss with sorted trees
  • Trees need huge number of splits 
  • Model tries to treat all categories differently
Constructing Mean Encoding
  • Goods - Number of ones in a group
  • Bads - Number of zeros
Likelihood = Goods/(Goods + Bads) = mean(target)
Weight of Evidence = In(Goods/Bads)*100
Count = Goods = sum(target)
Diff = Goods-Bads

means = x_tr.groupby(col).target.mean()
train_new[col+'_mean_target'] = train_new[col].map(means)
val_new[col+'_mean_target'] = val_new[col].map(means)
means
#Fit XGBoost model on this new data
dtrain = xgb.DMatrix(train_new,label=y_tr)
dvalid = xgb.Dmatrix(val_new,label=y_val)
evalist = [(dtrain,'train'),(dvalid,'eval')]
evals_result3 = {}
model = xgb.train(xgb_par,dtrain,3000,evals=evalist,verbose_eval=30,evals_result=evals_result3,early_stopping_rounds=50)

Happy Learning!!!

No comments: