February 23, 2016

K-medoids, K-means

Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
  • Use your own custom distance functions when involved with numerical and categorical variables
  • Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute
Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquard’s coef

