- Label Encoding - Based on Sort Order, Order of Appearance
- Frequency Encoding - Based on Percentage of occurence
- Sex, Cabin, Embarked
- One Hot Encoding
- pandas.get_dummies
- sklearn.preprocessing.OneHotEncoder
- Works well for Linear methods (Minimum is zero, Maximum is 1)
- Difficult for Tree methods based on One Hot Encoding Approach
- Store only Non-Zero Elements (Sparse Matrices)
- Create combination of features and get better results
- Concatenate strings from both columns
- One hot encoding it, Find optimal coefficient for every interaction
3,male,3male
1,female,1female
3,female,3female
1,female,1female
pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0
Ordinal Features
- Ordered categorial feature
- First class expensive, second less, third least expensive
- Drivers License Type A,B,C,D
- Level of Education (Sorted in increasingly complex order)
- Label Encoding, Map to numbers (Tree based)
- Non Tree can't use effectively
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder
2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize
Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
- Equal Distributiona apply rank ties
- from scipy.stats import rankdata
- Ordinal is special case of categorial feature
- Label Encoding maps categories to numbers
- Frequency encoding maps categories to frequencies
- Label and frequency encoding are used for Tree based models
- One-Hot encoding is used for non-tree based models
- Interactions of categorial features can help linear models and KNN
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Tip #1 | |
import scipy.stats.rankdata | |
#Log transformation | |
x = np.log(1+x) | |
#Raising to power < 1 | |
x = np.sqrt(x+2/3) | |
#Tip #2 | |
#Min Max Scalar | |
#To [0,1] | |
import sklearn.preprocessing.MinMaxScaler | |
X = (X-X.min())/(X.max()-X.min()) | |
#Tip #3 | |
#Standard Scaler | |
#To mean = 0, std = 1 | |
import sklearn.preprocessing.StandardScaler | |
X = (X-X.mean())/X.std() | |
#Tip #4 | |
#One Hot Encoding | |
import pandas.get_dummies | |
import sklearn.preprocessing.OneHotEncoder | |
#Tip #5 | |
#Label Encoding | |
#Alphabetical sorted [S,C,D] -> [2,1,3] | |
import sklearn.preprocessing.LabelEncoder | |
#Tip #6 | |
#Order of Appearance | |
#[S,C,Q] -> [1,2,3] | |
import Pandas.Factorize | |
#Tip #7 | |
#Frequency Encoding (Depending on Percentage of Occurences) | |
[S,C,Q] -> [0.5,0.3,0.2] | |
encoding -> titanic.groupby('Embarked').size() | |
encoding = encoding/len(titanic) | |
titanic['enc'] = titanic.Embarked.map(encoding) | |
#Equal Distributiona apply rank ties | |
import scipy.stats import rankdata |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#https://www.kaggle.com/discdiver/category-encoders-examples | |
#pip install category_encoders | |
import numpy as np | |
import pandas as pd | |
import category_encoders as ce | |
from sklearn.preprocessing import LabelEncoder | |
df = pd.DataFrame({'color':['red','green','blue','orange','purple','violet','red'], | |
'outcome':[1,0,1,2,3,5,7]}) | |
#drop columns | |
x = df.drop('outcome',axis=1) | |
y = df.drop('color',axis=1) | |
print(x) | |
le = LabelEncoder() | |
x_encoded = le.fit_transform(np.ravel(x)) | |
#integer encoded for each categirical variable | |
print(x_encoded) | |
print(type(x_encoded)) | |
#Ordinal — convert string labels to integer values 1 through k. Ordinal. | |
ordinal_encoding = ce.OrdinalEncoder(cols=['color']) | |
x_oencoded = ordinal_encoding.fit_transform(x) | |
print(x_oencoded) | |
#OneHot — one column for each value to compare vs. all other values. Nominal, ordinal. | |
onehot = ce.OneHotEncoder(cols=['color']) | |
x_ohe = onehot.fit_transform(x) | |
print(x_ohe) | |
x_ohe1 = onehot.fit_transform(x,y) | |
print(x_ohe) | |
#Binary — convert each integer to binary digits. Each binary digit gets one column. Some info loss but fewer dimensions. Ordinal. | |
be = ce.BinaryEncoder(cols=['color']) | |
xbe = be.fit_transform(x) | |
print(xbe) | |
#BaseN | |
#BaseN — Ordinal, Binary, or higher encoding. Nominal, ordinal. Doesn’t add much functionality. Probably avoid. | |
baseencoding = ce.BaseNEncoder() | |
#baseencoding = ce.BaseNEncoder(base=3) | |
xbe = baseencoding.fit_transform(x) | |
print(xbe) | |
#Hashing — Like OneHot but fewer dimensions, some info loss due to collisions. Nominal, ordinal. | |
he = ce.HashingEncoder(cols=['color']) | |
#he = ce.HashingEncoder(hash_method='md5') | |
xhe = he.fit_transform(x) | |
print(xhe) | |
#Postal code - convert the postal codes into latitude and longitude | |
#Colors Multivalent categorical: one or more values from standard colors “white,” ”yellow,” ”green,” etc | |
#Color - Convert the textual values into numeric RGB values | |
#Count — the number of occurrences — discrete | |
#Time — cyclical numbers with a temporal component — continuous | |
No comments:
Post a Comment