"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 27, 2017

Day #72- Feature Generation - Numeric Features

Feature Generation
  • Predict Apple Sales (Linear Trend)
  • Examples - Add features indicating week number, GBDT will consider min calculated value for each week
  • Created Generated Tree
Numeric Features - Preprocessing
  • Tree based Methods (Decision Tree)
  • Non Tree based Methods (NN, Linear Model, KNN)
Technique #1 - Scaling of values
  • Apply Regularization in equal amounts
  • Do proper scaling
Min Max Scalar
  • To [0,1]
  • sklearn.preprocessing.MinMaxScaler
  • X = (X-X.min())/(X.max()-X.min())
Standard Scaler
  • To mean = 0, std = 1
  • sklearn.preprocessing.StandardScaler
  • X = (X-X.mean())/X.std()
Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers
  • Calculate lower and upper bound values
  • Rank transformation
  • Better option than Min-Max Scale
Ranking, Transformations
  • scipy.stats.rankdata
  • Log transformation  - np.log(1+x)
  • Raising to power < 1 - np.sqrt(x+2/3)
Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)
  • Creating new features
  • Engineer using prior knowledge and logic
  • Example, Adding price per square feet if price and size of plot is provided
Summary
  • Tree based methods don't depend on scaling
  • Non-Tree methods hugely depend on scaling
Most often used preprocessing
  • MinMaxScaler - to [0,1]
  • StandardScaler - to mean==0, std==1
  • Rank - sets spaces between sorted values to be equal
  • np.log(1+x) and np.sqrt(1+x)
 Happy Learning and Coding!!!

No comments: