"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering
  • Highly used algorithm
  • You need to decide on number of data groups
How it works ?
  • Start with random guess of cluster centres
  • Go through every point and compute distance between C1 and C2 (Cluster Centres)
How to measure good clustering ?
  • Intra cluster distances minimized
  • Inter cluster distances maximized
Cost function 
  • Sum of squared distances from each point to its cluster center should be minimum
  • Iteration to Iteration cost function will keep decreasing
Learning Points
  • While trying to measure, global cluster centre, local cluster centre points were identified
  • For all cluster points, sum of squares computed with global cluster centre
Mathematical Learnings 
What is sum-of-squared distances method ?
What is Euclidean Distance ?
  • Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
  • Link - ref
What is local optima ?
  • Local optima are defined as the relative best solutions within a neighbour solution set.
How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters

Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

  • Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
  • Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster
Other Techniques
  • Remove correlation before computing distances
  • Mahalanobis distance measure
  • (1-correlation coefficient)
More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

Happy Learning!!!

No comments: