"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering
  • Highly used algorithm
  • You need to decide on number of data groups
How it works ?
  • Start with random guess of cluster centres
  • Go through every point and compute distance between C1 and C2 (Cluster Centres)
How to measure good clustering ?
  • Intra cluster distances minimized
  • Inter cluster distances maximized
Cost function 
  • Sum of squared distances from each point to its cluster center should be minimum
  • Iteration to Iteration cost function will keep decreasing
Learning Points
  • While trying to measure, global cluster centre, local cluster centre points were identified
  • For all cluster points, sum of squares computed with global cluster centre
Mathematical Learnings 
What is sum-of-squared distances method ?
What is Euclidean Distance ?
  • Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
  • Link - ref
What is local optima ?
  • Local optima are defined as the relative best solutions within a neighbour solution set.
How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters


Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

  • Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
  • Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster
Other Techniques
  • Remove correlation before computing distances
  • Mahalanobis distance measure
  • (1-correlation coefficient)
More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
 Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquards coef
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)
Hierarchical Clustering
  • Compute distance in every pair of cluster
  • Manage nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm it is plotted



Happy Learning!!!

No comments: