Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #81

October 30, 2017

Dataset cleaning

Constant features (Remove constants features who value remain constant in both training and testing data, Value is constant in training but changes in testing - better to remove those features, Only fraction of features supplied in data, Same value in both training and testing set)
Duplicated features (Completely identical columns, This will slow down training time, remove duplicate columns)
Duplicated categorical features (Encode categorical features and compare them)

Other things to check

Duplicated rows (Duplicated rows with different targets, could be result of mistake, remove those duplicated rows to have high score on test set)
Check for common rows in train and test sets (Set labels manually for test rows in training set)
Check if dataset is shuffled (Oscillations around mean would be observed)

EDA Checklist

	#Constant values for a particular feature remove them
	traintest.nunique(axis=1)==1

	#Duplicated features, remove one of them
	traintest.T.drop_duplicates()

	#Duplicated Categorical features
	for f in categorical_feats:
	traintest[f] = traintest[f].factorize()
	traintest.T.drop_duplicates()

view raw Datasetcleaning.py hosted with ❤ by GitHub

Happy Learning and Coding!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)