Had a Good Session on K-means clustering. Code snippet, notes in this post
Clustering - Assignment of observation into subsets that are similar in some sense
K-Means Clustering
K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel
K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.
K-means
Happy Learning!!!
Clustering - Assignment of observation into subsets that are similar in some sense
K-Means Clustering
- Highly used algorithm
- You need to decide on number of data groups
- Start with random guess of cluster centres
- Go through every point and compute distance between C1 and C2 (Cluster Centres)
- Intra cluster distances minimized
- Inter cluster distances maximized
- Sum of squared distances from each point to its cluster center should be minimum
- Iteration to Iteration cost function will keep decreasing
Learning Points
- While trying to measure, global cluster centre, local cluster centre points were identified
- For all cluster points, sum of squares computed with global cluster centre
Mathematical Learnings
What is sum-of-squared distances method ?
What is Euclidean Distance ?
- Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
- Link - ref
What is local optima ?
- Local optima are defined as the relative best solutions within a neighbour solution set.
How to choose value of K ?
Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters
Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.
- Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
- Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster
- Remove correlation before computing distances
- Mahalanobis distance measure
- (1-correlation coefficient)
K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel
K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.
K-means
- Prone to outliers (Squared Euclidean gives greater weight to more distant points)
- Can't handle categorical data
- Work with Euclidean only
- Restrict centre to data points
- Centre picked up only from data points
- We use same sum of squares for cost function but distance is not Euclidean distance
- Euclidean based distance
- Correlation based distance
- Mahalanobis distance
- Matching coef and Jaquards coef
- Single (Minimum Distance between two pairs one from each clusters)
- Complete (Maximum between two pairs one from each clusters)
- Average (Average of all possible pairs)
- Compute distance in every pair of cluster
- Manage nearest ones until number of clusters = number of clusters needed
- Entire process can be represented as dendrogram
- At the end of the algorithm it is plotted
No comments:
Post a Comment