"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 31, 2017

Day #82 - Validation and Overfitting


  • Train Data (Past), Unseen Test Data (Future)
  • Divide into three parts - Train (Past), Validation (Past), Test (Future)
  • Underfitting (High Error on Both Training and Validation)
  • Overfitting (Doesn't generalize to test data, Low Error on Train, High Error on Validation)
  • Ideas (Lowest Error on both Training and Testing Data)
Validation Strategies
  • Hold Out (divide data into training / testing, No overlap between training / testing data ) - Used on Shuffle Data
  • K-Fold (Repeated hold out because we split our data) - Good Choice for medium amount of data, K- 1 training, one subset - Used on Shuffle Data
  • Leave one out : ngroups = len(train) - Too Little data (Special case of K fold, K = number of samples)
  • Stratification - Similar target distribution over different folds
Stratification useful for
  • Small datasets (Do Random Splits)
  • Unbalanced datasets
  • Multiclass classification
 Stratification preserves the target distribution over different folds

Happy Coding and Learning!!!

No comments: