Dataset cleaning
- Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
- Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
- Duplicated categorical features (Encode categorical features and compare them)
- Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
- Check for common rows in train and test sets (Set labels manually for test rows in training set)
- Check if dataset is shuffled (Oscillations around mean would be observed)
- Get Domain Knowledge
- Check How data is generated
- Explore individual feature
- Explore pairs and groups
- Clean features
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Constant values for a particular feature remove them | |
traintest.nunique(axis=1)==1 | |
#Duplicated features, remove one of them | |
traintest.T.drop_duplicates() | |
#Duplicated Categorical features | |
for f in categorical_feats: | |
traintest[f] = traintest[f].factorize() | |
traintest.T.drop_duplicates() |
No comments:
Post a Comment