Dataset cleaning
- Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
- Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
- Duplicated categorical features (Encode categorical features and compare them)
- Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
- Check for common rows in train and test sets (Set labels manually for test rows in training set)
- Check if dataset is shuffled (Oscillations around mean would be observed)
- Get Domain Knowledge
- Check How data is generated
- Explore individual feature
- Explore pairs and groups
- Clean features
No comments:
Post a Comment