"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

October 27, 2017

Day #73 - Feature Generation - Categorical and ordinal features

  • Label Encoding - Based on Sort Order, Order of Appearance
  • Frequency Encoding - Based on Percentage of occurence
Categorical Features
  • Sex, Cabin, Embarked
  • One Hot Encoding
  • pandas.get_dummies
  • sklearn.preprocessing.OneHotEncoder
  • Works well for Linear methods (Minimum is zero, Maximum is 1)
  • Difficult for Tree methods based on One Hot Encoding Approach
  • Store only Non-Zero Elements (Sparse Matrices)
  • Create combination of features and get better results
  • Concatenate strings from both columns
  • One hot encoding it, Find optimal coefficient for every interaction
pclass,sex,pclass_sex
3,male,3male
1,female,1female
3,female,3female
1,female,1female

pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0

Ordinal Features
  • Ordered categorial feature
  • First class expensive, second less, third least expensive
  • Drivers License Type A,B,C,D
  • Level of Education (Sorted in increasingly complex order)
  • Label Encoding, Map to numbers (Tree based)
  • Non Tree can't use effectively
Label Encoding
1. Alphabetical sorted [S,C,D] -> [2,1,3]
 - sklearn.preprocessing.LabelEncoder

2. Order of Appearance
[S,C,Q] -> [1,2,3]
 - Pandas.Factorize

Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)

Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
  • Equal Distributiona apply rank ties
  • from scipy.stats import rankdata
Summary
  • Ordinal is special case of categorial feature
  • Label Encoding maps categories to numbers
  • Frequency encoding maps categories to frequencies
  • Label and frequency encoding are used for Tree based models
  • One-Hot encoding is used for non-tree based models
  • Interactions of categorial features can help linear models and KNN

#Tip #1
import scipy.stats.rankdata
#Log transformation
x = np.log(1+x)
#Raising to power < 1
x = np.sqrt(x+2/3)
#Tip #2
#Min Max Scalar
#To [0,1]
import sklearn.preprocessing.MinMaxScaler
X = (X-X.min())/(X.max()-X.min())
#Tip #3
#Standard Scaler
#To mean = 0, std = 1
import sklearn.preprocessing.StandardScaler
X = (X-X.mean())/X.std()
#Tip #4
#One Hot Encoding
import pandas.get_dummies
import sklearn.preprocessing.OneHotEncoder
#Tip #5
#Label Encoding
#Alphabetical sorted [S,C,D] -> [2,1,3]
import sklearn.preprocessing.LabelEncoder
#Tip #6
#Order of Appearance
#[S,C,Q] -> [1,2,3]
import Pandas.Factorize
#Tip #7
#Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
#Equal Distributiona apply rank ties
import scipy.stats import rankdata
view raw Kaggletips.py hosted with ❤ by GitHub
#https://www.kaggle.com/discdiver/category-encoders-examples
#pip install category_encoders
import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'color':['red','green','blue','orange','purple','violet','red'],
'outcome':[1,0,1,2,3,5,7]})
#drop columns
x = df.drop('outcome',axis=1)
y = df.drop('color',axis=1)
print(x)
le = LabelEncoder()
x_encoded = le.fit_transform(np.ravel(x))
#integer encoded for each categirical variable
print(x_encoded)
print(type(x_encoded))
#Ordinal — convert string labels to integer values 1 through k. Ordinal.
ordinal_encoding = ce.OrdinalEncoder(cols=['color'])
x_oencoded = ordinal_encoding.fit_transform(x)
print(x_oencoded)
#OneHot — one column for each value to compare vs. all other values. Nominal, ordinal.
onehot = ce.OneHotEncoder(cols=['color'])
x_ohe = onehot.fit_transform(x)
print(x_ohe)
x_ohe1 = onehot.fit_transform(x,y)
print(x_ohe)
#Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal.
be = ce.BinaryEncoder(cols=['color'])
xbe = be.fit_transform(x)
print(xbe)
#BaseN
#BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid.
baseencoding = ce.BaseNEncoder()
#baseencoding = ce.BaseNEncoder(base=3)
xbe = baseencoding.fit_transform(x)
print(xbe)
#Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal.
he = ce.HashingEncoder(cols=['color'])
#he = ce.HashingEncoder(hash_method='md5')
xhe = he.fit_transform(x)
print(xhe)
#Postal code - convert the postal codes into latitude and longitude
#Colors Multivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc
#Color - Convert the textual values into numeric RGB values
#Count — the number of occurrences — discrete
#Time — cyclical numbers with a temporal component — continuous
Happy Coding and Learning!!!

No comments: