- Label Encoding - Based on Sort Order, Order of Appearance
- Frequency Encoding - Based on Percentage of occurence
- Sex, Cabin, Embarked
- One Hot Encoding
- pandas.get_dummies
- sklearn.preprocessing.OneHotEncoder
- Works well for Linear methods (Minimum is zero, Maximum is 1)
- Difficult for Tree methods based on One Hot Encoding Approach
- Store only Non-Zero Elements (Sparse Matrices)
- Create combination of features and get better results
- Concatenate strings from both columns
- One hot encoding it, Find optimal coefficient for every interaction
3,male,3male
1,female,1female
3,female,3female
1,female,1female
pclass_sex ==
1male,1female,2male,2female,3male,3female
0,0,0,0,1,0
0,1,0,0,0,0
0,0,0,0,0,1
0,1,0,0,0,0
Ordinal Features
- Ordered categorial feature
- First class expensive, second less, third least expensive
- Drivers License Type A,B,C,D
- Level of Education (Sorted in increasingly complex order)
- Label Encoding, Map to numbers (Tree based)
- Non Tree can't use effectively
1. Alphabetical sorted [S,C,D] -> [2,1,3]
- sklearn.preprocessing.LabelEncoder
2. Order of Appearance
[S,C,Q] -> [1,2,3]
- Pandas.Factorize
Frequency Encoding (Depending on Percentage of Occurences)
[S,C,Q] -> [0.5,0.3,0.2]
encoding -> titanic.groupby('Embarked').size()
encoding = encoding/len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
Frequency encoding will help for Linear based models (If frequency is correlated with target value then linear model will use the dependency). Preserve value distribution.
- Equal Distributiona apply rank ties
- from scipy.stats import rankdata
- Ordinal is special case of categorial feature
- Label Encoding maps categories to numbers
- Frequency encoding maps categories to frequencies
- Label and frequency encoding are used for Tree based models
- One-Hot encoding is used for non-tree based models
- Interactions of categorial features can help linear models and KNN
No comments:
Post a Comment