"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 27, 2018

Day #100 - Ensemble Methods

It took more than a year to reach 100 posts. This is a significant milestone. Hoping to reach 200 soon.
  • Combining different machine learning models  for more powerful prediction
  • Averaging or blending
  • Weighted averaging
  • Conditional averaging
  • Bagging
  • Boosting
  • Stacking
  • Stacknet
Averaging ensemble methods
  • Combine two results with simple averaging
  • (model1+model2)/2
  • Considerable improvements with averaging can be achieved
  • Perform better when combined but not individually
  • Weighted average - (model1*0.7 + model2*0.3)
  • Conditional average (If < 50 use model1 else model2)
Bagging
  • Averaging slightly different versions of same model to improve accuracy
  • Example - Random Forest
  • Underfitting - Error in bias
  • Overfitting - Errors in variance
  • Parameters that control bagging - Seed, Subsampling or Bootstrapping, Shuffling, Column Subsampling, Model specific parameters, bags (number of models), More bags better results, parallelism
  • BaggingClassifier and BaggingRegressor from sklearn
  • Independant of each other
Boosting
  • Weight based boosting
  • Form of weighted averaging of models where each model is built sequentially via taking into account of past model performance
  • Add sequentially how well previous models have done
Weight based boosting
  • Number of times certain row appears in data
  • Contribution to error / recalculate weights
  • Parameters - Learning rate, shrinkage, trust many models, number of estimators
  • Parameters - Adaboost (sklearn - python), LogitBoost (Weka - java)
Residual based boosting
  • For videos mostly dominant
  • Calculate error of predictions / direction of error
  • Make Error new target variable
  • Parameters - Learning Rate, Shrinkage, ETA
  • Number of estimators
  • Row sub sampling
  • Column sub sampling
  • Sub boosting type - Fully gradient based, Dart
  • XGboost
  • Lightgbm
  • H2O GBM (Handle categorical variables out of box)
  • Catboost
  • Sklearn's GBM
Stacking
  • Making several predictions of a number of models in a hold out set and then using a different meta model to train these predictions
  • Stacking predictions
  • Splitting training set into two disjoint sets
  • Train several base learners on the first part
  • Make predictions with the base learners on the second (validation) part
  • Using predictions from (3) as the input to train a higher level learner
  • Train Algo 0 on A and make predictions for B and C and Save to B1, C1
  • Train Algo 1 on A and make predictions for B and C and Save to B1, C1
  • Train Algo 2 on A and make predictions for B and C and Save to B1, C1
  • Train Algorithm3 on B1 and make predictions for C1

#train is the training data
#test is test data
#y is the target variable
model = RandomForestRegressor()
bags=10
seed=1
#create array object to hold bagged predictions
bagged_prediction=np.zeros(test.shape[0])
#loop for as many times as we want bags
for n in range(0,bags):
model.set_params(random_state=seed+n)#update seed
model.fit(train,y)#fit model
preds=model.predict(test) #predict on test data
bagged_prediction += preds #add to bagged predictions
#take average
bagged_prediction /= bags
view raw ensemble.py hosted with ❤ by GitHub
#Stacking Example
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split
training, valid, ytraining, yvalid = train_test_split(train,y,test_size=0.5)
model1 = RandomForestRegressor()
model2 = LinearRegression()
model1.fit(training,ytraining)
model2.fit(training,ytraining)
preds1 = model1.predict(valid)
preds2 = model2.predict(valid)
test_pred1 = model1.predict(test)
test_pred2 = model2.predict(test)
stacked_predictions = np.column_stack((preds1, preds2)
stacked_test_predictions np.column_stack((test_preds1, test_preds2))
meta_model = LinearRegression()
meta_model.fit(stacked_predictions,yvalid)
final_predictions=meta_model.predict(stacked_test_predictions)
Happy Learning!!!

No comments: