"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label Clustering. Show all posts
Showing posts with label Clustering. Show all posts

June 10, 2023

DBScan vs KMeans Summary

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and K-means are two popular clustering algorithms used for unsupervised learning tasks. They have different approaches to clustering and are suitable for different types of data. Here's a comparison of the two algorithms:

Approach:

DBSCAN: DBSCAN is a density-based clustering algorithm. It groups together points that are closely packed together based on a distance measure (e.g., Euclidean distance) and a density threshold. It can find clusters of arbitrary shapes and is also able to identify noise points that do not belong to any cluster.

K-means: K-means is a centroid-based clustering algorithm. It partitions the data into K clusters by minimizing the sum of squared distances between the data points and their corresponding cluster centroids. K-means assumes that clusters are spherical and have similar sizes.

Number of clusters:

DBSCAN: The number of clusters is determined automatically by the algorithm based on the input parameters (distance threshold and minimum number of points). You don't need to specify the number of clusters beforehand.

K-means: You need to specify the number of clusters (K) beforehand. Choosing the optimal value of K can be challenging and often requires domain knowledge or using techniques like the elbow method or silhouette analysis.

Cluster shapes:

DBSCAN: DBSCAN can find clusters of arbitrary shapes, making it suitable for datasets with complex structures.

K-means: K-means assumes that clusters are spherical and have similar sizes, which may not be suitable for datasets with complex structures or clusters with different shapes and sizes.

Handling noise:

DBSCAN: DBSCAN can identify and separate noise points that do not belong to any cluster.

K-means: K-means is sensitive to noise and outliers, as they can significantly affect the position of the cluster centroids.

Scalability:

DBSCAN: DBSCAN can be slower than K-means for large datasets, especially if the distance matrix needs to be computed. However, there are optimized versions of DBSCAN (e.g., HDBSCAN) that can handle large datasets more efficiently.

K-means: K-means is generally faster and more scalable than DBSCAN, especially when using optimized implementations (e.g., MiniBatchKMeans in scikit-learn).

In summary, DBSCAN is more suitable for datasets with complex structures, arbitrary cluster shapes, and noise, while K-means is faster and more scalable but assumes spherical clusters with similar sizes. The choice between DBSCAN and K-means depends on the characteristics of the data and the specific requirements of the clustering task.




Keep Exploring!!!

March 06, 2023

K Means vs K - Modes

The average taken for a set of numbers is called a mean. The middle value in the data set is called the Median. 

The number that occurs the most in a given list of numbers is called a mode.

K-modes is really only applicable for categoricial data. Not for sparse numerical data like bag-of-words or tf-idf vectors.

Silhouette Method - This method measure the distance from points in one cluster to the other clusters. Then visually you have silhouette plots that let you choose K.

  • K-means clustering for numerical data.
  • K-prototype clustering on mixed data. (numerical + categorical data)

Handle categorical variable

  • It depends on your categorical variable being used. For ordinal variables, say like bad,average and good, it makes sense just to use one variable and have values 0,1,2 

Algos List

  • Partitioning-based algorithms: k-Prototypes, Squeezer
  • Hierarchical algorithms: ROCK, Agglomerative single, average, and complete linkage
  • Density-based algorithms: HIERDENC, MULIC, CLIQUE
  • Model-based algorithms: SVM clustering, Self-organizing maps
  • Cluster using e.g., k-means or DBSCAN, based on only the continuous features
  • Use k-prototypes to directly cluster the mixed data
  • Use FAMD (factor analysis of mixed data) to reduce the mixed data to a set of derived continuous features which can then be clustered.

The k-means algorithm is the most widely used centre based partitional clustering algorithm.

K modes changes

  • using a simple matching dissimilarity measure for categorical objects,
  • replacing means of clusters by modes, and
  • using a frequency-based method to update the modes.

Ref - Link1, Link2

Keep Exploring!!!




February 20, 2020

Day #324 - Plumbing Old 'R' Code

What I like in my Approach
  • R Example with 
  • Data Normalization
  • Cluster Data
  • Build Regression on top of Clustered Data


Happy Learning!!!

October 11, 2019

Day #283 - Clustering to group similar Images


For large retail datasets, before object detection. Clustering becomes essential to focus on each cluster to take it forward. Today's post is clustering images into similar groups

  • Generate Feature Data based on VGG / Resnet
  • Cluster them using Kmeans
  • Result output to their respective cluster directory


Input - Mixed Set of Images
Output 
Cluster 1
Cluster 2

Cluster 3
More Reads - Example (in R)

Happy Learning!!!

August 30, 2019

Day #272 - Clustering to find Data Insights

For different kinds of data, we need to pick the right columns to find the right insights. Most of them can be picked up with domain knowledge, Clustering perspective analysis.
  • For Sales Data, Clustering intent was to find sales insights. Data Sources were CustomerId, NumberOfOrders, TotalOrderValue. Clustered the same to find order value buckets. High Value, Medium, Low Value.
  • For Loss in Retail Store, Clustering insights to find loss patterns. Data Sources were SkuId, LossCount, LossValue. Clustered them to High Loss, Medium, Low Loss buckets.
This helps to address the key loss items and focus on proactive measures to prevent further loss. After a long time picked up R. Forgot execution command Ctrl+A, Ctrl+Enter. Every editor, tools, language have their own patterns, formats of coding.

This is the same as 'Recency, Frequency and Monetary value'. I came to know about this today (7/6/2020). Sometimes what you already implemented might be already done in some pattern :)

Recency, Frequency, Monetary Model with Python — and how Sephora uses it to optimize their Google and Facebook Ads
Find Your Best Customers with Customer Segmentation in Python
Introduction to Customer Segmentation in Python

Happy Learning!!!

February 23, 2016

Hierarchical Clustering


  • Compute distance in every pair of cluster
  • Merge nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm dendogram is plotted
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)

Happy Learning!!!

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering
  • Highly used algorithm
  • You need to decide on number of data groups
How it works ?
  • Start with random guess of cluster centres
  • Go through every point and compute distance between C1 and C2 (Cluster Centres)
How to measure good clustering ?
  • Intra cluster distances minimized
  • Inter cluster distances maximized
Cost function 
  • Sum of squared distances from each point to its cluster center should be minimum
  • Iteration to Iteration cost function will keep decreasing
Learning Points
  • While trying to measure, global cluster centre, local cluster centre points were identified
  • For all cluster points, sum of squares computed with global cluster centre
Mathematical Learnings 
What is sum-of-squared distances method ?
What is Euclidean Distance ?
  • Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
  • Link - ref
What is local optima ?
  • Local optima are defined as the relative best solutions within a neighbour solution set.
How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters


Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

  • Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
  • Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster
Other Techniques
  • Remove correlation before computing distances
  • Mahalanobis distance measure
  • (1-correlation coefficient)
More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
 Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquards coef
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)
Hierarchical Clustering
  • Compute distance in every pair of cluster
  • Manage nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm it is plotted



Happy Learning!!!

November 08, 2015

K Means Clustering


I'm slowly moving in Stats with a lot of learning. This post is from my class notes

K-means clustering

  • Finding groups of object similar to one another
  • Partitioning cluster approach
  • Mean moves every time (Within first few iterations it will converge)
  • Classify a given data set through a certain number of clusters
  • This does not fit well for Sparse / Dense clusters

Great 5 Minute Video



Step 1 - "Figure out centric of region"
Step 2 - "Select K Data points randomly"
Step 3 - "Assign each data point to nearest centre"
Step 4 - "Recalculate the new centroids"
Step 5 - "Repeat Step 3,4"

More Reads - K-Means Clustering

DTW  - Dynamic Time Warping Algorithm. DTW - allowing similar shapes to match even if they are out of phase in the time axis

Ref - Link

Happy Learning!!!