Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Day #73 - Feature Generation

October 27, 2017

Day #73 - Feature Generation - Categorical and ordinal features

Label Encoding - Based on Sort Order, Order of Appearance
Frequency Encoding - Based on Percentage of occurence

Categorical Features

Sex, Cabin, Embarked
One Hot Encoding
pandas.get_dummies
sklearn.preprocessing.OneHotEncoder
Works well for Linear methods (Minimum is zero, Maximum is 1)
Difficult for Tree methods based on One Hot Encoding Approach
Store only Non-Zero Elements (Sparse Matrices)
Create combination of features and get better results
Concatenate strings from both columns
One hot encoding it, Find optimal coefficient for every interaction

pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features

Ordered categorial feature
First class expensive, second less, third least expensive
Drivers License Type A,B,C,D
Level of Education (Sorted in increasingly complex order)
Label Encoding, Map to numbers (Tree based)
Non Tree can't use effectively

Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.

Equal Distributiona apply rank ties
from scipy.stats import rankdata

Summary

Ordinal is special case of categorial feature
Label Encoding maps categories to numbers
Frequency encoding maps categories to frequencies
Label and frequency encoding are used for Tree based models
One-Hot encoding is used for non-tree based models
Interactions of categorial features can help linear models and KNN

	#Tip #1
	import scipy.stats.rankdata
	#Log transformation
	x = np.log(1+x)
	#Raising to power < 1
	x = np.sqrt(x+2/3)

	#Tip #2
	#Min Max Scalar
	#To [0,1]
	import sklearn.preprocessing.MinMaxScaler
	X = (X-X.min())/(X.max()-X.min())

	#Tip #3
	#Standard Scaler
	#To mean = 0, std = 1
	import sklearn.preprocessing.StandardScaler
	X = (X-X.mean())/X.std()

	#Tip #4
	#One Hot Encoding
	import pandas.get_dummies
	import sklearn.preprocessing.OneHotEncoder

	#Tip #5
	#Label Encoding
	#Alphabetical sorted [S,C,D] -> [2,1,3]
	import sklearn.preprocessing.LabelEncoder

	#Tip #6
	#Order of Appearance
	#[S,C,Q] -> [1,2,3]
	import Pandas.Factorize

	#Tip #7
	#Frequency Encoding (Depending on Percentage of Occurences)
	[S,C,Q] -> [0.5,0.3,0.2]
	encoding -> titanic.groupby('Embarked').size()
	encoding = encoding/len(titanic)
	titanic['enc'] = titanic.Embarked.map(encoding)

	#Equal Distributiona apply rank ties
	import scipy.stats import rankdata

view raw Kaggletips.py hosted with ❤ by GitHub

	#https://www.kaggle.com/discdiver/category-encoders-examples
	#pip install category_encoders
	import numpy as np
	import pandas as pd
	import category_encoders as ce
	from sklearn.preprocessing import LabelEncoder

	df = pd.DataFrame({'color':['red','green','blue','orange','purple','violet','red'],
	'outcome':[1,0,1,2,3,5,7]})

	#drop columns
	x = df.drop('outcome',axis=1)
	y = df.drop('color',axis=1)

	print(x)

	le = LabelEncoder()
	x_encoded = le.fit_transform(np.ravel(x))

	#integer encoded for each categirical variable
	print(x_encoded)
	print(type(x_encoded))

	#Ordinal — convert string labels to integer values 1 through k. Ordinal.
	ordinal_encoding = ce.OrdinalEncoder(cols=['color'])

	x_oencoded = ordinal_encoding.fit_transform(x)
	print(x_oencoded)

	#OneHot — one column for each value to compare vs. all other values. Nominal, ordinal.
	onehot = ce.OneHotEncoder(cols=['color'])
	x_ohe = onehot.fit_transform(x)
	print(x_ohe)

	x_ohe1 = onehot.fit_transform(x,y)
	print(x_ohe)

	#Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.
	be = ce.BinaryEncoder(cols=['color'])
	xbe = be.fit_transform(x)
	print(xbe)

	#BaseN
	#BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid.
	baseencoding = ce.BaseNEncoder()
	#baseencoding = ce.BaseNEncoder(base=3)
	xbe = baseencoding.fit_transform(x)
	print(xbe)

	#Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.
	he = ce.HashingEncoder(cols=['color'])
	#he = ce.HashingEncoder(hash_method='md5')
	xhe = he.fit_transform(x)
	print(xhe)

	#Postal code - convert the postal codes into latitude and longitude
	#Colors Multivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc
	#Color - Convert the textual values into numeric RGB values
	#Count — the number of occurrences — discrete
	#Time — cyclical numbers with a temporal component — continuous

view raw category_encoders_example.py hosted with ❤ by GitHub

Happy Coding and Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 27, 2017

Day #73 - Feature Generation - Categorical and ordinal features

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts