"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 06, 2023

K Means vs K - Modes

The average taken for a set of numbers is called a mean. The middle value in the data set is called the Median. 

The number that occurs the most in a given list of numbers is called a mode.

K-modes is really only applicable for categoricial data. Not for sparse numerical data like bag-of-words or tf-idf vectors.

Silhouette Method - This method measure the distance from points in one cluster to the other clusters. Then visually you have silhouette plots that let you choose K.

  • K-means clustering for numerical data.
  • K-prototype clustering on mixed data. (numerical + categorical data)

Handle categorical variable

  • It depends on your categorical variable being used. For ordinal variables, say like bad,average and good, it makes sense just to use one variable and have values 0,1,2 

Algos List

  • Partitioning-based algorithms: k-Prototypes, Squeezer
  • Hierarchical algorithms: ROCK, Agglomerative single, average, and complete linkage
  • Density-based algorithms: HIERDENC, MULIC, CLIQUE
  • Model-based algorithms: SVM clustering, Self-organizing maps
  • Cluster using e.g., k-means or DBSCAN, based on only the continuous features
  • Use k-prototypes to directly cluster the mixed data
  • Use FAMD (factor analysis of mixed data) to reduce the mixed data to a set of derived continuous features which can then be clustered.

The k-means algorithm is the most widely used centre based partitional clustering algorithm.

K modes changes

  • using a simple matching dissimilarity measure for categorical objects,
  • replacing means of clusters by modes, and
  • using a frequency-based method to update the modes.

Ref - Link1, Link2

Keep Exploring!!!




No comments: