"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 29, 2017

Day #79 - Exploratory Data Analysis (EDA)

EDA
  • Looking data, Understanding data
  • Complete data understanding required to build accurate models
  • Generate Hypothesis / Apply Intuition 
  • Top solutions use Advanced and Aggressive Modelling
  • Find insights and magic feature, Start with EDA before hardcore modeling
Visualization
  • Identify Patterns (Visualization to idea)
  • Use patterns to find better models (Idea to visualization, Hypothesis testing)
EDA Steps
  • Domain Knowledge (Google, Wikipedia understand data)
  • Check data is Intuitive (Values in data validate based on acquired domain knowledge, Manual correction of error, Mark incorrect rows and label them for model to leverage it)
  • Understand how data is generated (Test set / Training set generated by the Same Algorithm ? / Need to know underlying data generation Process / Visualize Training / Test set plots)
Exploring Anonymized and Encrypted Data
Anonymized Data
  • Replace data with encrypted text (This will not impact model though)
  • No meaningful names of columns
  • Find unique values of features, sort them and find differences
  • Distance between two consecutive features and the pattern for it
Explore Individual Features
  • Guess the meaning of the columns
  • Guess the types of the column (Categorical, Boolean, Numeric etc..)
Explore Feature Relations
  • Find relation between pairs
  • Find feature groups
Useful Python functions
  • df.dtypes
  • df.info()
  • x.value_counts()
  • x.isnull()
Happy Learning and Coding!!!

No comments: