Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 29, 2017

Day #77 - Quick Summary - Kaggle Lessons - Features, Dates, Text

For Features - One Hot Encoding, Label Encoding, Frequency Encoding, Ranking, MinMaxScaler, StandardScaler
For Dates - Periodicity - Year, Date, Week, Time Slice - Time past since particular moment (before / after), Difference in Dates (Datetime_feature1 - Datetime_feature2), Boolean binary indicating date is holiday or not
For Text - Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal, Ngrams can help use local context, Postprocessing - TFiDF, Use BOW for Ngrams

Happy Coding and Learning!!!

Day #76 - Text Processing - Kaggle Lessons

Bag of Words

Create new column for each unique word in data
Count occurrences in each documents
sklearn.feature_extraction.text.CountVectorizer
More comparable by using Term Frequency
tf = 1 / x.sum(axis=1)[:,None]
x = x*tf
Inverse Document Frequency
idf = np.log(x.shape[0])/(x>0).sum(0)
N Grams
Bag of Words (Each row represents text, Each column represents unique word)
Classifying document

For N = 1, This is a sentence
Unigrams are - This, is, a , sentence

For N = 2, This is a sentence
bigrams are - This is, is a, a sentence

For N = 3, This is a sentence
Trigrams are - This is a, is a sentence

sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer

Text Preprocessing steps

Lower case
Lemmatization (using knowledge of vocabulary and morphological analysis of words)
democracy, democratic and democratization -> democracy (Lemmatization)
Stemming (Chops of ending of words)
democracy, democratic, and democratization - democr (Stemming)
Stop words (Not contain important information)

sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words

I have done all this in my assignment work. This is there in my github code

For Applying Bag of words

Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
Ngrams can help use local context
Postprocessing - TFiDF
Use BOW for Ngrams

BOW example

Sentence - The dog is on the table
Representation - are, cat, dog, is, now, on, the, table
BOW representation - 0, 0, 1, 1, 0, 1, 1, 1

BOW Issue

The food was good, not bad at all

The food was bad, not good at all

Both representations are the same however the meaning varies :)

Word to Vectors

Get vector representation of words and texts
Each word converted to vector
Uses nearby words
Different words used in same context will be used in vector representation
Apply basic operations can be done on vectors
Words - Word2Vec, Glove, FastText
Sentences - Doc2Vec
There are pretrained models

Bag of Words

Very large vectors
Meaning of each value in vector is unknown

Word2Vec

Relatively small vectors
Values of vector can be interpreted only in some cases
The words with similar meaning often have similar embeddings

Happy Learning, Happy Coding!!!

October 27, 2017

Day #75 - Missing Values

Reasons for Missing Values
How to Engineer them effectively
Hidden Missing Values
Plot distribution of values and find from histogram

Filling missing Values

-999, -1 (Fill with some value) - useful to provide different category, Perf Suffers
mean, median
Reconstruct value
add isnull column

Reconstruction

Missing values in timeseries
Temperature values missing for some days of month
Based on increase / decrease pattern
Ignore missing value while calculating mean
Change Categories to frequencies
XGBoost can handle NAN

Happy Learning and Coding!!!

Day #74 - Feature Generation - DateTime and Coordinates

DateTime

Differ Significantly between numeric and categorial features
Periodicity - Year, Date, Week
Time Slice - Time past since particular moment (before / after), Time moments in period
Difference in Dates (Datetime_feature1 - Datetime_feature2)
Special Time period (Medication every 3 days)
Sales Predictions (Days since last holiday, Days since weekend, Since last sales campaign)
Boolean binary indicating date is holiday or not
Sales Context Churn Prediction
(Date Since user registration) - DateDiff
(Date Since last purchase) - DateDiff
(Date Since calling customer service) - DateDiff
Periodicity - Day number in week, month, season, year, second, minute, hour
Time Slice, Difference between dates

Coordinates

This can be used for churn prediction (Likelihood customer will return)
In Real Estate Scenario for predictions on Prices
(Distance from School)
(Distance from Airport)
(Flats around particular point)
Alternatively distance from maximum expensive flat
Centre of clusters and find distances from centre point
Aggregated Statistics for surrounding data

Happy Learning and Coding!!!

Day #73 - Feature Generation - Categorical and ordinal features

Label Encoding - Based on Sort Order, Order of Appearance
Frequency Encoding - Based on Percentage of occurence

Categorical Features

Sex, Cabin, Embarked
One Hot Encoding
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder
Works well for Linear methods (Minimum is zero, Maximum is 1)
Difficult for Tree methods based on One Hot Encoding Approach
Store only Non-Zero Elements (Sparse Matrices)
Create combination of features and get better results
Concatenate strings from both columns
One hot encoding it, Find optimal coefficient for every interaction

pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features

Ordered categorial feature
First class expensive, second less, third least expensive
Drivers License Type A,B,C,D
Level of Education (Sorted in increasingly complex order)
Label Encoding, Map to numbers (Tree based)
Non Tree can't use effectively

Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.

Equal Distributiona apply rank ties
from scipy.stats import rankdata

Summary

Ordinal is special case of categorial feature
Label Encoding maps categories to numbers
Frequency encoding maps categories to frequencies
Label and frequency encoding are used for Tree based models
One-Hot encoding is used for non-tree based models
Interactions of categorial features can help linear models and KNN

Happy Coding and Learning!!!

Day #72- Feature Generation - Numeric Features

Feature Generation

Predict Apple Sales (Linear Trend)
Examples - Add features indicating week number, GBDT will consider min calculated value for each week
Created Generated Tree

Numeric Features - Preprocessing

Tree based Methods (Decision Tree)
Non Tree based Methods (NN, Linear Model, KNN)

Technique #1 - Scaling of values

Apply Regularization in equal amounts
Do proper scaling

Min Max Scalar

To [0,1]
sklearn.preprocessing.MinMaxScaler
X = (X-X.min())/(X.max()-X.min())

Standard Scaler

To mean = 0, std = 1
sklearn.preprocessing.StandardScaler
X = (X-X.mean())/X.std()

Preprocessing (Scaling) should be done for all features not just for fewer features. Initial impact on the model will be roughly similar
Preprocessing Outliers

Calculate lower and upper bound values
Rank transformation
Better option than Min-Max Scale

Ranking, Transformations

scipy.stats.rankdata
Log transformation - np.log(1+x)
Raising to power < 1 - np.sqrt(x+2/3)

Feature Generation (Based on Feature Knowledge, Exploratory Data Analysis)

Creating new features
Engineer using prior knowledge and logic
Example, Adding price per square feet if price and size of plot is provided

Summary

Tree based methods don't depend on scaling
Non-Tree methods hugely depend on scaling

Most often used preprocessing

MinMaxScaler - to [0,1]
StandardScaler - to mean==0, std==1
Rank - sets spaces between sorted values to be equal
np.log(1+x) and np.sqrt(1+x)

Happy Learning and Coding!!!

Day #71 - Kaggle Best Practices

After a long pause back to learning mode. This post is on learning's from Coursera course - Winning Kaggle Competitions (https://www.coursera.org/learn/competitive-data-science)

Session #1
Basics on Kaggle Competition

Data - text, pictures (Format could be csv, database, text file, speech etc). Accompanied by description of features
Model - Exactly built during competition. Transforms data into answers, Model propertiese - Product best possible prediction and be reproducible
Submissions - Compare against models and predictions submitted
Evaluations - How good is your model, Quality of model defined by Evaluation function (Rate of correct answers)
Evaluation Criteria - Accuracy, Logistic Loss, AUC, RMSE, MAE

Guidelines for submissions

Analyze data
Fit model
Submit
See Public Score

Data Science Competition Platforms

kaggle
DrivenData
CrowdAnalityx
CodaLab
DataScienceChallenge.net
DataScience.net
KDD
VizDooM

Session #2
Using Kaggle

Data format and explanations
Evaluation Criteria
Sample Submission File
Timelines page

Kaggle vs Real world Competitions

Real - World Machine learning problems have Several Stages

Understand business problem
Problem Formulation
Collect Data, Mine Examples
Clean Data
Preprocess it
Identify Model for Task
Select best models
Accuracy
Deploy model (make it available to users)
Monitor and retrain with new data

Kaggle

All data collected and problems fixed
Model creation and evaluation

Summary

Real world problems are complicated
Competition are a great way to learn
But Kaggle competitions don't address the questions of formalization, deployment and testing

Key insights

Importance of understaing data
Tools to users
Try (Complex solutions, Advance feature engineering, doing huge calculation)

Session #3

Linear Model (Classifying two set of points using linear lines, 2Dimesions) - Logistics Regression, SVM (Linear models with different loss functions), Good for Sparse High Dimesional data, Linear models into two subspaces
Tree based - Use decision tree as basic block to build more complicated models (Tree based Decision Tree, Random Forest, Gradient Boosted Decision Trees), DT - Divide and Conquer approach to Recursively split spaces into sub spaces. Tree based methods split spaces into boxes
KNN - Nearest Neighbours, Labels for points shown, Points close to each other are likely to have similar labels, K nearest objects and label with majority votes, Relies heavily on measure points
Neural Networks - Special class of ML models, Blackbox produces most seperating curves, Play with parameters of simple feed forward networks. Good for image, sounds, text, speech

Session #4 - Data Preprocessing

Preprocess for feature engineering
Basic feature generation for different types of features
Numeric, Categorical, DateTime based features

Features

(0/1) - Binary Features
Numeric features (Age, fare)
Categorical (Classes)

Feature Preprocessing

Each feature has own ways to be preprocessed
Depends on model to use
Linear models not for two class features
One hot encoder
Random forest can easily put each class seperately and predict each probability

Data Types

Structured Data

Ordinal - Ranks 1st / 2nd / 3rd Ordinal Data
Numerical - Specific Numeric Data
Continuous - Petrol Prices continuous data
Categorical - Days of Week, Months of Year

Happy Learning and Coding!!!

September 25, 2017

Data Science Leadership Role Knowledge Requirements

Happy Leading and Learning!!!

September 01, 2017

Exploring Analytics in Microsoft Azure

I am working on Microsoft Azure platform on a BI cloud solution. Some of the key components I worked recently are

Azure Data Factory
Azure Data Lake
Azure SQL Data warehouse
Power BI on top of Data warehouse for reporting

I had earlier compared different stacks Microsoft / Google / Amazon.

The high level workflow for cloud based BI Solution and key components are

Step #1 - Moving Data from In premises to Cloud

Here data management gateway is installed on the in-premises machines, Pipelines are created in Azure Data factory to move data from In-premises to Azure Data lake

Step #2 - Azure Data Factory

ADF provides platform for data ingestion, Consuming high volumes of data. This experience setting up pipelines has some similarities and differences compared to SSIS. The key differences are

Everything is JSON based
Setting up Connections
Defining input and output data formats in datsets
Input and Output datasets also define the storage locations
Defining Pipeline logic which includes, logic, input, output datasets, scheduling for pipeline
This is bit straight forward but there is some learning with the tool, configuration properties

Step #3 - Azure Data lake

Azure Datalake is for storing data (RDBMS / No-RDBMS) data, If we have to integrate data from MSSQL, MYSQL for a realtime processing from two sources, We can leverage data lake to store and consolidate it later. The data stored in Datalake are referenced as external tables in AZURE Sql Datawarehouse

Step #4 - Web Application

All the references of data movement from Datalake and connectivity to Datawarehouse is managed by Access control leveraged with a Azure web app. The security aspect is well managed in Azure infrastructure

Step #5 - Data Consolidation into SQL Datawarehouse

The external tables referenced in Datalake can be referenced, queried in TSQL format and data loaded in Azure Datawarehouse tables. This is the location of fact and dimension tables that would power our datawarehouse. This could be done by stored procedures.

Step #6 - Power BI reporting

We have completed Data loading, data consolidation. The next is Power BI. PowerBI has the most power offering for web / mobile platforms. This is convenient and easy to use. The extended Analytics / R Support / Machine Library support also makes it suitable to run both Business Intelligence / Machine Learning solutions.

Security aspects of this architecture is well handled with Firewall, IAM access as needed. This seems very stable even some of the components are constantly updated. This is high level architecture explanation, We will look into To-do exercises in coming weeks.

Happy Learning!!!

July 19, 2017

Day #70 - Machine Learning - Deep Learning Fundamentals - Machine Learning Notes

Picture is worth 1000 words, Few examples listed in the book are very precise, clear on Machine Learning fundamentals. Below are few of the images on Machine learning / Deep Learning Concepts

Figure #1

How machine learning, AL and Deep Learning are inter-related, The subset representation clearly represents the knowledge boundaries
Deep Learning frameworks allow developers to iterate quickly, Making algos accessible to practitioners. Deep learning frameworks help to scale machine learning code for millions of users
Its important to note fundamentals of Machine Learning is important to work with Deep Learning

Figure #2

In Machine learning, historical data is used to derive learning's / rules from it and apply it for future data predictions
From the data we need to identify (relevant features / variables), In this process we use different techniques like PCA, Correlation techniques, Derived features to identify relevant feature attributes for model creation
From the vast amount of data we collect through enterprise applications / systems we need to identify / extract relevant data to build models and validate them. Setting up the data pipeline, training with required dataset becomes key for better / high accuracy models

Figure #3

High level perspective of Deep Learning, How the nodes are defined, weights computed
The loss part for each iteration is compared with predictions and sent back to perform weight updates, This iterations we call it as back propagation
Deep Learning term is because the network are 'deep' - multiple hidden layers involved in computation

Figure #4

SVM Wide street approach, line that separates two classes
Allow non-linear decision boundaries
Each dimension represents feature
Goal of SVN - Train a model that assigns unseen objects into particular category
Advantage - High Dimensionality, Memory Efficiency, Versatility

Machine Learning Notes