Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Clustering

Showing posts with label Clustering. Show all posts

June 10, 2023

DBScan vs KMeans Summary

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and K-means are two popular clustering algorithms used for unsupervised learning tasks. They have different approaches to clustering and are suitable for different types of data. Here's a comparison of the two algorithms:

Approach:

DBSCAN: DBSCAN is a density-based clustering algorithm. It groups together points that are closely packed together based on a distance measure (e.g., Euclidean distance) and a density threshold. It can find clusters of arbitrary shapes and is also able to identify noise points that do not belong to any cluster.

K-means: K-means is a centroid-based clustering algorithm. It partitions the data into K clusters by minimizing the sum of squared distances between the data points and their corresponding cluster centroids. K-means assumes that clusters are spherical and have similar sizes.

Number of clusters:

DBSCAN: The number of clusters is determined automatically by the algorithm based on the input parameters (distance threshold and minimum number of points). You don't need to specify the number of clusters beforehand.

K-means: You need to specify the number of clusters (K) beforehand. Choosing the optimal value of K can be challenging and often requires domain knowledge or using techniques like the elbow method or silhouette analysis.

Cluster shapes:

DBSCAN: DBSCAN can find clusters of arbitrary shapes, making it suitable for datasets with complex structures.

K-means: K-means assumes that clusters are spherical and have similar sizes, which may not be suitable for datasets with complex structures or clusters with different shapes and sizes.

Handling noise:

DBSCAN: DBSCAN can identify and separate noise points that do not belong to any cluster.

K-means: K-means is sensitive to noise and outliers, as they can significantly affect the position of the cluster centroids.

Scalability:

DBSCAN: DBSCAN can be slower than K-means for large datasets, especially if the distance matrix needs to be computed. However, there are optimized versions of DBSCAN (e.g., HDBSCAN) that can handle large datasets more efficiently.

K-means: K-means is generally faster and more scalable than DBSCAN, especially when using optimized implementations (e.g., MiniBatchKMeans in scikit-learn).

In summary, DBSCAN is more suitable for datasets with complex structures, arbitrary cluster shapes, and noise, while K-means is faster and more scalable but assumes spherical clusters with similar sizes. The choice between DBSCAN and K-means depends on the characteristics of the data and the specific requirements of the clustering task.

Keep Exploring!!!

March 06, 2023

K Means vs K - Modes

The average taken for a set of numbers is called a mean. The middle value in the data set is called the Median.

The number that occurs the most in a given list of numbers is called a mode.

K-modes is really only applicable for categoricial data. Not for sparse numerical data like bag-of-words or tf-idf vectors.

Silhouette Method - This method measure the distance from points in one cluster to the other clusters. Then visually you have silhouette plots that let you choose K.

K-means clustering for numerical data.
K-prototype clustering on mixed data. (numerical + categorical data)

Handle categorical variable

It depends on your categorical variable being used. For ordinal variables, say like bad,average and good, it makes sense just to use one variable and have values 0,1,2

Algos List

Partitioning-based algorithms: k-Prototypes, Squeezer
Hierarchical algorithms: ROCK, Agglomerative single, average, and complete linkage
Density-based algorithms: HIERDENC, MULIC, CLIQUE
Model-based algorithms: SVM clustering, Self-organizing maps
Cluster using e.g., k-means or DBSCAN, based on only the continuous features
Use k-prototypes to directly cluster the mixed data
Use FAMD (factor analysis of mixed data) to reduce the mixed data to a set of derived continuous features which can then be clustered.

The k-means algorithm is the most widely used centre based partitional clustering algorithm.

K modes changes

using a simple matching dissimilarity measure for categorical objects,
replacing means of clusters by modes, and
using a frequency-based method to update the modes.

Ref - Link1, Link2

Keep Exploring!!!

February 20, 2020

Day #324 - Plumbing Old 'R' Code

What I like in my Approach

R Example with
Data Normalization
Cluster Data
Build Regression on top of Clustered Data

Happy Learning!!!

October 11, 2019

Day #283 - Clustering to group similar Images

For large retail datasets, before object detection. Clustering becomes essential to focus on each cluster to take it forward. Today's post is clustering images into similar groups

Generate Feature Data based on VGG / Resnet
Cluster them using Kmeans
Result output to their respective cluster directory

Input - Mixed Set of Images

Output
Cluster 1

Cluster 2

Cluster 3

More Reads - Example (in R)

Happy Learning!!!

August 30, 2019

Day #272 - Clustering to find Data Insights

For different kinds of data, we need to pick the right columns to find the right insights. Most of them can be picked up with domain knowledge, Clustering perspective analysis.

For Sales Data, Clustering intent was to find sales insights. Data Sources were CustomerId, NumberOfOrders, TotalOrderValue. Clustered the same to find order value buckets. High Value, Medium, Low Value.
For Loss in Retail Store, Clustering insights to find loss patterns. Data Sources were SkuId, LossCount, LossValue. Clustered them to High Loss, Medium, Low Loss buckets.

This helps to address the key loss items and focus on proactive measures to prevent further loss. After a long time picked up R. Forgot execution command Ctrl+A, Ctrl+Enter. Every editor, tools, language have their own patterns, formats of coding.

This is the same as 'Recency, Frequency and Monetary value'. I came to know about this today (7/6/2020). Sometimes what you already implemented might be already done in some pattern :)

Recency, Frequency, Monetary Model with Python — and how Sephora uses it to optimize their Google and Facebook Ads
Find Your Best Customers with Customer Segmentation in Python
Introduction to Customer Segmentation in Python

Happy Learning!!!

February 23, 2016

Hierarchical Clustering

Compute distance in every pair of cluster
Merge nearest ones until number of clusters = number of clusters needed
Entire process can be represented as dendrogram
At the end of the algorithm dendogram is plotted

Measuring Distance between clusters

Single (Minimum Distance between two pairs one from each clusters)
Complete (Maximum between two pairs one from each clusters)
Average (Average of all possible pairs)

Happy Learning!!!

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering

Highly used algorithm
You need to decide on number of data groups

How it works ?

Start with random guess of cluster centres
Go through every point and compute distance between C1 and C2 (Cluster Centres)

How to measure good clustering ?

Intra cluster distances minimized
Inter cluster distances maximized

Cost function

Sum of squared distances from each point to its cluster center should be minimum
Iteration to Iteration cost function will keep decreasing

Learning Points

While trying to measure, global cluster centre, local cluster centre points were identified
For all cluster points, sum of squares computed with global cluster centre

Mathematical Learnings

What is sum-of-squared distances method ?

What is Euclidean Distance ?

Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
Link - ref

What is local optima ?

Local optima are defined as the relative best solutions within a neighbour solution set.

How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters

Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster

Other Techniques

Remove correlation before computing distances
Mahalanobis distance measure
(1-correlation coefficient)

More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means

Prone to outliers (Squared Euclidean gives greater weight to more distant points)
Can't handle categorical data
Work with Euclidean only

K-Medoids

Restrict centre to data points
Centre picked up only from data points
We use same sum of squares for cost function but distance is not Euclidean distance

Distance measure for numerical variables

Euclidean based distance
Correlation based distance
Mahalanobis distance

Distance measure for category variables

Matching coef and Jaquards coef

Measuring Distance between clusters

Single (Minimum Distance between two pairs one from each clusters)
Complete (Maximum between two pairs one from each clusters)
Average (Average of all possible pairs)

Hierarchical Clustering

Compute distance in every pair of cluster
Manage nearest ones until number of clusters = number of clusters needed
Entire process can be represented as dendrogram
At the end of the algorithm it is plotted

Happy Learning!!!

November 08, 2015

K Means Clustering

I'm slowly moving in Stats with a lot of learning. This post is from my class notes

K-means clustering

Finding groups of object similar to one another
Partitioning cluster approach
Mean moves every time (Within first few iterations it will converge)
Classify a given data set through a certain number of clusters
This does not fit well for Sparse / Dense clusters

Great 5 Minute Video

Step 1 - "Figure out centric of region"
Step 2 - "Select K Data points randomly"
Step 3 - "Assign each data point to nearest centre"
Step 4 - "Recalculate the new centroids"
Step 5 - "Repeat Step 3,4"

More Reads - K-Means Clustering

DTW - Dynamic Time Warping Algorithm. DTW - allowing similar shapes to match even if they are out of phase in the time axis

Ref - Link

Happy Learning!!!

June 10, 2023

March 06, 2023

February 20, 2020

October 11, 2019

August 30, 2019

February 23, 2016

February 18, 2016

November 08, 2015

About Me

What is your Expertise

Search This Blog

Git Code Repository

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts