Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #71

October 27, 2017

Day #71 - Kaggle Best Practices

After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition

Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
Submissions - Compare against models and predictions submitted
Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions

Analyze data
Fit model
Submit
See Public Score

Data Science Competition Platforms

kaggle
DrivenData
CrowdAnalityx
CodaLab
DataScienceChallenge.net
DataScience.net
KDD
VizDooM

Session #2
Using Kaggle

Data format and explanations
Evaluation Criteria
Sample Submission File
Timelines page

Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages

Understand business problem
Problem Formulation
Collect Data, Mine Examples
Clean Data
Preprocess it
Identify Model for Task
Select best models
Accuracy
Deploy model (make it available to users)
Monitor and retrain with new data

Kaggle

All data collected and problems fixed
Model creation and evaluation

Summary

Real world problems are complicated
Competition are a great way to learn
But Kaggle competitions don't address the questions of formalization, deployment and testing

Key insights

Importance of understaing data
Tools to users
Try (Complex solutions, Advance feature engineering, doing huge calculation)

Session #3

Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech

Session #4 - Data Preprocessing

Preprocess for feature engineering
Basic feature generation for different types of features
Numeric, Categorical, DateTime based features

Features

(0/1) - Binary Features
Numeric features (Age, fare)
Categorical (Classes)

Feature Preprocessing

Each feature has own ways to be preprocessed
Depends on model to use
Linear models not for two class features
One hot encoder
Random forest can easily put each class seperately and predict each probability

Data Types

Structured Data

Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
Numerical - Specific Numeric Data
Continuous - Petrol Prices continuous data
Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 27, 2017

Day #71 - Kaggle Best Practices

No comments:

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts