Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #100

January 27, 2018

Day #100 - Ensemble Methods

It took more than a year to reach 100 posts. This is a significant milestone. Hoping to reach 200 soon.

Combining different machine learning models for more powerful prediction
Averaging or blending
Weighted averaging
Conditional averaging
Bagging
Boosting
Stacking
Stacknet

Averaging ensemble methods

Combine two results with simple averaging
(model1+model2)/2
Considerable improvements with averaging can be achieved
Perform better when combined but not individually
Weighted average - (model1*0.7 + model2*0.3)
Conditional average (If < 50 use model1 else model2)

Bagging

Averaging slightly different versions of same model to improve accuracy
Example - Random Forest
Underfitting - Error in bias
Overfitting - Errors in variance
Parameters that control bagging - Seed, Subsampling or Bootstrapping, Shuffling, Column Subsampling, Model specific parameters, bags (number of models), More bags better results, parallelism
BaggingClassifier and BaggingRegressor from sklearn
Independant of each other

Boosting

Weight based boosting
Form of weighted averaging of models where each model is built sequentially via taking into account of past model performance
Add sequentially how well previous models have done

Weight based boosting

Number of times certain row appears in data
Contribution to error / recalculate weights
Parameters - Learning rate, shrinkage, trust many models, number of estimators
Parameters - Adaboost (sklearn - python), LogitBoost (Weka - java)

Residual based boosting

For videos mostly dominant
Calculate error of predictions / direction of error
Make Error new target variable
Parameters - Learning Rate, Shrinkage, ETA
Number of estimators
Row sub sampling
Column sub sampling
Sub boosting type - Fully gradient based, Dart
XGboost
Lightgbm
H2O GBM (Handle categorical variables out of box)
Catboost
Sklearn's GBM

Stacking

Making several predictions of a number of models in a hold out set and then using a different meta model to train these predictions
Stacking predictions
Splitting training set into two disjoint sets
Train several base learners on the first part
Make predictions with the base learners on the second (validation) part
Using predictions from (3) as the input to train a higher level learner
Train Algo 0 on A and make predictions for B and C and Save to B1, C1
Train Algo 1 on A and make predictions for B and C and Save to B1, C1
Train Algo 2 on A and make predictions for B and C and Save to B1, C1
Train Algorithm3 on B1 and make predictions for C1

	#train is the training data
	#test is test data
	#y is the target variable
	model = RandomForestRegressor()
	bags=10
	seed=1
	#create array object to hold bagged predictions
	bagged_prediction=np.zeros(test.shape[0])
	#loop for as many times as we want bags
	for n in range(0,bags):
	model.set_params(random_state=seed+n)#update seed
	model.fit(train,y)#fit model
	preds=model.predict(test) #predict on test data
	bagged_prediction += preds #add to bagged predictions
	#take average
	bagged_prediction /= bags

view raw ensemble.py hosted with ❤ by GitHub

	#Stacking Example
	from sklearn.ensemble import RandomForestRegressor
	from sklearn.linear_model import LinearRegression
	import numpy as np
	from sklearn.model_selection import train_test_split

	training, valid, ytraining, yvalid = train_test_split(train,y,test_size=0.5)
	model1 = RandomForestRegressor()
	model2 = LinearRegression()

	model1.fit(training,ytraining)
	model2.fit(training,ytraining)

	preds1 = model1.predict(valid)
	preds2 = model2.predict(valid)

	test_pred1 = model1.predict(test)
	test_pred2 = model2.predict(test)

	stacked_predictions = np.column_stack((preds1, preds2)
	stacked_test_predictions np.column_stack((test_preds1, test_preds2))

	meta_model = LinearRegression()
	meta_model.fit(stacked_predictions,yvalid)
	final_predictions=meta_model.predict(stacked_test_predictions)

view raw StackingExample.py hosted with ❤ by GitHub

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

January 27, 2018

Day #100 - Ensemble Methods

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts