The average taken for a set of numbers is called a mean. The middle value in the data set is called the Median.
The number that occurs the most in a given list of numbers is called a mode.
K-modes is really only applicable for categoricial data. Not for sparse numerical data like bag-of-words or tf-idf vectors.
Silhouette Method - This method measure the distance from points in one cluster to the other clusters. Then visually you have silhouette plots that let you choose K.
- K-means clustering for numerical data.
- K-prototype clustering on mixed data. (numerical + categorical data)
Handle categorical variable
- It depends on your categorical variable being used. For ordinal variables, say like bad,average and good, it makes sense just to use one variable and have values 0,1,2
Algos List
- Partitioning-based algorithms: k-Prototypes, Squeezer
- Hierarchical algorithms: ROCK, Agglomerative single, average, and complete linkage
- Density-based algorithms: HIERDENC, MULIC, CLIQUE
- Model-based algorithms: SVM clustering, Self-organizing maps
- Cluster using e.g., k-means or DBSCAN, based on only the continuous features
- Use k-prototypes to directly cluster the mixed data
- Use FAMD (factor analysis of mixed data) to reduce the mixed data to a set of derived continuous features which can then be clustered.
The k-means algorithm is the most widely used centre based partitional clustering algorithm.
K modes changes
- using a simple matching dissimilarity measure for categorical objects,
- replacing means of clusters by modes, and
- using a frequency-based method to update the modes.
Keep Exploring!!!
No comments:
Post a Comment