"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 29, 2017

Day #78 - Image Processing - Kaggle Lessons

  • Use Trained model on data similar
  • Train network from scratch
  • Using pretrained model and Fine tune later
VGGNet16 Architecture
  • Remove Last layer with new one size of 4
  • Retrain model
  • Benefit from model trained from similar dataset
Image Augmentation
  • Increase number of training samples
  • Image rotations
Happy Learning and Coding!!!

Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text

  • For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
  • For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
  • For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF,  Use BOW for Ngrams
Happy Coding and Learning!!!

Day #76 - Text Processing - Kaggle Lessons


Bag of Words
  • Create new column for each unique word in data
  • Count occurrences in each documents
  • sklearn.feature_extraction.text.CountVectorizer
  • More comparable by using Term Frequency
  • tf = 1 / x.sum(axis=1)[:,None]
  • x = x*tf
  • Inverse Document Frequency
  • idf = np.log(x.shape[0])/(x>0).sum(0)
  • N Grams
  • Bag of Words (Each row represents text, Each column represents unique word)
  • Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps
  • Lower case
  • Lemmatization (using knowledge of vocabulary and morphological analysis of words)
  • democracy, democratic and democratization -> democracy (Lemmatization)
  • Stemming (Chops of ending of words)
  • democracy, democratic, and democratization - democr (Stemming)
  • Stop words (Not contain important information)
sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words
  • Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
  • Ngrams can help use local context
  • Postprocessing - TFiDF
  • Use BOW for Ngrams
BOW example
  • Sentence - The dog is on the table
  • Representation         - are, cat, dog, is, now, on, the, table
  • BOW representation  - 0,    0,    1,    1,     0,      1,    1,    1
BOW Issue

The food was good, not bad at all
The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors
  • Get vector representation of words and texts
  • Each word converted to vector
  • Uses nearby words
  • Different words used in same context will be used in vector representation
  • Apply basic operations can be done on vectors
  • Words - Word2Vec, Glove, FastText
  • Sentences - Doc2Vec
  • There are pretrained models
Bag of Words
  • Very large vectors
  • Meaning of each value in vector is unknown
Word2Vec
  • Relatively small vectors
  • Values of vector can be interpreted only in some cases
  • The words with similar meaning often have similar embeddings
Happy Learning, Happy Coding!!!

October 27, 2017

Day #75 - Missing Values

  • Reasons for Missing Values
  • How to Engineer them effectively
  • Hidden Missing Values
  • Plot distribution of values and find from histogram
Filling missing Values
  • -999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
  • mean, median
  • Reconstruct value
  • add isnull column
Reconstruction
  • Missing values in timeseries
  • Temperature values missing for some days of month
  • Based on increase / decrease pattern
  • Ignore missing value while calculating mean
  • Change Categories to frequencies
  • XGBoost can handle NAN
Happy Learning and Coding!!!

Day #74 - Feature Generation - DateTime and Coordinates

DateTime
  • Differ Significantly between numeric and categorial features
  • Periodicity - Year, Date, Week
  • Time Slice - Time past since particular moment (before / after), Time moments in period
  • Difference in Dates (Datetime_feature1 - Datetime_feature2)
  • Special Time period (Medication every 3 days)
  • Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
  • Boolean binary indicating date is holiday or not
  • Sales Context Churn Prediction
  •     (Date Since user registration) - DateDiff
  •     (Date Since last purchase) - DateDiff
  •     (Date Since calling customer service) - DateDiff
  • Periodicity - Day number in week, month, season, year, second, minute, hour
  • Time Slice, Difference between dates
Coordinates
  • This can be used for churn prediction (Likelihood customer will return)
  • In Real Estate Scenario for predictions on Prices
  •     (Distance from School)
  •     (Distance from Airport)
  •     (Flats around particular point)
  • Alternatively distance from maximum expensive flat
  • Centre of clusters and find distances from centre point
  • Aggregated Statistics for surrounding data
Happy Learning and Coding!!!

Day #73 - Feature Generation - Categorical and ordinal features

  • Label Encoding - Based on Sort Order, Order of Appearance
  • Frequency Encoding - Based on Percentage of occurence
Categorical Features
  • Sex, Cabin, Embarked
  • One Hot Encoding
  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder
  • Works well for Linear methods (Minimum is zero, Maximum is 1)
  • Difficult for Tree methods based on One Hot Encoding Approach
  • Store only Non-Zero Elements (Sparse Matrices)
  • Create combination of features and get better results
  • Concatenate strings from both columns
  • One hot encoding it, Find optimal coefficient for every interaction
pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features
  • Ordered categorial feature
  • First class expensive, second less, third least expensive
  • Drivers License Type A,B,C,D
  • Level of Education (Sorted in increasingly complex order)
  • Label Encoding, Map to numbers (Tree based)
  • Non Tree can't use effectively
Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
 - sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
 - Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
  • Equal Distributiona apply rank ties
  • from scipy.stats import rankdata
Summary
  • Ordinal is special case of categorial feature
  • Label Encoding maps categories to numbers
  • Frequency encoding maps categories to frequencies
  • Label and frequency encoding are used for Tree based models
  • One-Hot encoding is used for non-tree based models
  • Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

Day #72- Feature Generation - Numeric Features

Feature Generation
  • Predict Apple Sales (Linear Trend)
  • Examples - Add features indicating week number, GBDT will consider min calculated value for each week
  • Created Generated Tree
Numeric Features - Preprocessing
  • Tree based Methods (Decision Tree)
  • Non Tree based Methods (NN, Linear Model, KNN)
Technique #1 - Scaling of values
  • Apply Regularization in equal amounts
  • Do proper scaling
Min Max Scalar
  • To [0,1]
  • sklearn.preprocessing.MinMaxScaler
  • X = (X-X.min())/(X.max()-X.min())
Standard Scaler
  • To mean = 0, std = 1
  • sklearn.preprocessing.StandardScaler
  • X = (X-X.mean())/X.std()
Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers
  • Calculate lower and upper bound values
  • Rank transformation
  • Better option than Min-Max Scale
Ranking, Transformations
  • scipy.stats.rankdata
  • Log transformation  - np.log(1+x)
  • Raising to power < 1 - np.sqrt(x+2/3)
Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)
  • Creating new features
  • Engineer using prior knowledge and logic
  • Example, Adding price per square feet if price and size of plot is provided
Summary
  • Tree based methods don't depend on scaling
  • Non-Tree methods hugely depend on scaling
Most often used preprocessing
  • MinMaxScaler - to [0,1]
  • StandardScaler - to mean==0, std==1
  • Rank - sets spaces between sorted values to be equal
  • np.log(1+x) and np.sqrt(1+x)
 Happy Learning and Coding!!!

Day #71 - Kaggle Best Practices


After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition
  • Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
  • Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
  • Submissions - Compare against models and predictions submitted
  • Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
  • Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions
  • Analyze data
  • Fit model
  • Submit
  • See Public Score
Data Science Competition Platforms
  • kaggle
  • DrivenData
  • CrowdAnalityx
  • CodaLab
  • DataScienceChallenge.net
  • DataScience.net
  • KDD
  • VizDooM
Session #2
Using Kaggle
  • Data format and explanations
  • Evaluation Criteria
  • Sample Submission File
  • Timelines page
Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages
  • Understand business problem
  • Problem Formulation
  • Collect Data, Mine Examples
  • Clean Data
  • Preprocess it
  • Identify Model for Task
  • Select best models
  • Accuracy
  • Deploy model (make it available to users)
  • Monitor and retrain with new data

Kaggle
  • All data collected and problems fixed
  • Model creation and evaluation
Summary
  • Real world problems are complicated
  • Competition are a great way to learn
  • But Kaggle competitions don't address the questions of formalization, deployment and testing
Key insights
  • Importance of understaing data
  • Tools to users
  • Try (Complex solutions, Advance feature engineering, doing huge calculation)
Session #3
  • Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
  • Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
  • KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
  • Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech
Session #4 - Data Preprocessing
  • Preprocess for feature engineering
  • Basic feature generation for different types of features
  • Numeric, Categorical, DateTime based features
Features
  • (0/1) - Binary Features
  • Numeric features (Age, fare)
  • Categorical (Classes)
Feature Preprocessing
  • Each feature has own ways to be preprocessed
  • Depends on model to use
  • Linear models not for two class features
  • One hot encoder
  • Random forest can easily put each class seperately and predict each probability

Data Types
Structured Data
  • Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
  • Numerical - Specific Numeric Data
  • Continuous - Petrol Prices continuous data
  • Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!

September 25, 2017

September 01, 2017

Exploring Analytics in Microsoft Azure

I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are
  • Azure Data Factory
  • Azure Data Lake
  • Azure SQL Data warehouse
  • Power BI on top of Data warehouse for reporting
I had earlier compared different stacks Microsoft / Google / Amazon.

The high level workflow for cloud based BI Solution and key components are

Step #1 - Moving Data from In premises to Cloud
Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake

Step #2 - Azure Data Factory
ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are
  • Everything is JSON based
  • Setting up Connections
  • Defining input and output data formats in datsets
  • Input and Output datasets also define the storage locations
  • Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
  • This is bit straight forward but there is some learning with the tool, configuration properties
Step #3 - Azure Data lake
Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse

Step #4 - Web Application
All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure

Step #5 - Data Consolidation into SQL Datawarehouse
The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.

Step #6 - Power BI reporting
We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.

Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.

Happy Learning!!!