"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 27, 2017

Day #71 - Kaggle Best Practices


After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition
  • Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
  • Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
  • Submissions - Compare against models and predictions submitted
  • Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
  • Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions
  • Analyze data
  • Fit model
  • Submit
  • See Public Score
Data Science Competition Platforms
  • kaggle
  • DrivenData
  • CrowdAnalityx
  • CodaLab
  • DataScienceChallenge.net
  • DataScience.net
  • KDD
  • VizDooM
Session #2
Using Kaggle
  • Data format and explanations
  • Evaluation Criteria
  • Sample Submission File
  • Timelines page
Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages
  • Understand business problem
  • Problem Formulation
  • Collect Data, Mine Examples
  • Clean Data
  • Preprocess it
  • Identify Model for Task
  • Select best models
  • Accuracy
  • Deploy model (make it available to users)
  • Monitor and retrain with new data

Kaggle
  • All data collected and problems fixed
  • Model creation and evaluation
Summary
  • Real world problems are complicated
  • Competition are a great way to learn
  • But Kaggle competitions don't address the questions of formalization, deployment and testing
Key insights
  • Importance of understaing data
  • Tools to users
  • Try (Complex solutions, Advance feature engineering, doing huge calculation)
Session #3
  • Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
  • Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
  • KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
  • Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech
Session #4 - Data Preprocessing
  • Preprocess for feature engineering
  • Basic feature generation for different types of features
  • Numeric, Categorical, DateTime based features
Features
  • (0/1) - Binary Features
  • Numeric features (Age, fare)
  • Categorical (Classes)
Feature Preprocessing
  • Each feature has own ways to be preprocessed
  • Depends on model to use
  • Linear models not for two class features
  • One hot encoder
  • Random forest can easily put each class seperately and predict each probability

Data Types
Structured Data
  • Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
  • Numerical - Specific Numeric Data
  • Continuous - Petrol Prices continuous data
  • Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!

No comments: