After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)
Session #1
Basics on Kaggle Competition
- Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
- Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
- Submissions - Compare against models and predictions submitted
- Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
- Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE
Guidelines for submissions
- Analyze data
- Fit model
- Submit
- See Public Score
- kaggle
- DrivenData
- CrowdAnalityx
- CodaLab
- DataScienceChallenge.net
- DataScience.net
- KDD
- VizDooM
Using Kaggle
- Data format and explanations
- Evaluation Criteria
- Sample Submission File
- Timelines page
Real - World Machine learning problems have Several Stages
- Understand business problem
- Problem Formulation
- Collect Data, Mine Examples
- Clean Data
- Preprocess it
- Identify Model for Task
- Select best models
- Accuracy
- Deploy model (make it available to users)
- Monitor and retrain with new data
Kaggle
- All data collected and problems fixed
- Model creation and evaluation
- Real world problems are complicated
- Competition are a great way to learn
- But Kaggle competitions don't address the questions of formalization, deployment and testing
- Importance of understaing data
- Tools to users
- Try (Complex solutions, Advance feature engineering, doing huge calculation)
- Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
- Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
- KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
- Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech
- Preprocess for feature engineering
- Basic feature generation for different types of features
- Numeric, Categorical, DateTime based features
- (0/1) - Binary Features
- Numeric features (Age, fare)
- Categorical (Classes)
- Each feature has own ways to be preprocessed
- Depends on model to use
- Linear models not for two class features
- One hot encoder
- Random forest can easily put each class seperately and predict each probability
Data Types
Structured Data
- Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
- Numerical - Specific Numeric Data
- Continuous - Petrol Prices continuous data
- Categorical - Days of Week, Months of Year
No comments:
Post a Comment