Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Cluster Analytics

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering

Highly used algorithm
You need to decide on number of data groups

How it works ?

Start with random guess of cluster centres
Go through every point and compute distance between C1 and C2 (Cluster Centres)

How to measure good clustering ?

Intra cluster distances minimized
Inter cluster distances maximized

Cost function

Sum of squared distances from each point to its cluster center should be minimum
Iteration to Iteration cost function will keep decreasing

Learning Points

While trying to measure, global cluster centre, local cluster centre points were identified
For all cluster points, sum of squares computed with global cluster centre

Mathematical Learnings

What is sum-of-squared distances method ?

What is Euclidean Distance ?

Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
Link - ref

What is local optima ?

Local optima are defined as the relative best solutions within a neighbour solution set.

How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters

Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster

Other Techniques

Remove correlation before computing distances
Mahalanobis distance measure
(1-correlation coefficient)

More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means

Prone to outliers (Squared Euclidean gives greater weight to more distant points)
Can't handle categorical data
Work with Euclidean only

K-Medoids

Restrict centre to data points
Centre picked up only from data points
We use same sum of squares for cost function but distance is not Euclidean distance

Distance measure for numerical variables

Euclidean based distance
Correlation based distance
Mahalanobis distance

Distance measure for category variables

Matching coef and Jaquards coef

Measuring Distance between clusters

Single (Minimum Distance between two pairs one from each clusters)
Complete (Maximum between two pairs one from each clusters)
Average (Average of all possible pairs)

Hierarchical Clustering

Compute distance in every pair of cluster
Manage nearest ones until number of clusters = number of clusters needed
Entire process can be represented as dendrogram
At the end of the algorithm it is plotted

Happy Learning!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

February 18, 2016

Cluster Analytics - Deep Dive on K Means

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts