K-means
- Prone to outliers (Squared Euclidean gives greater weight to more distant points)
- Can't handle categorical data
- Work with Euclidean only
- Restrict centre to data points
- Centre picked up only from data points
- We use same sum of squares for cost function but distance is not Euclidean distance
- Use your own custom distance functions when involved with numerical and categorical variables
- Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute
Distance measure for numerical variables
- Euclidean based distance
- Correlation based distance
- Mahalanobis distance
Distance measure for category variables
- Matching coef and Jaquard’s coef
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#K medioids (Non-Hierarchical Clustering) | |
#data frame | |
food = read.csv("protein.csv") | |
#Pass DF, Number of Clusters | |
pam.result <- pam(food[,-1],2) | |
pam.result$clustering | |
summary(pam.result) | |
#use manhattan measure | |
#Pass DF, Number of Clusters | |
#Argument diss=TRUE indicates that we use the dissimilarity matrix | |
#Partitioning Around Medoids | |
pam.result <- pam(food[,-1],k=2,diss=FALSE,metric="manhattan") | |
pam.result$clustering | |
summary(pam.result) | |
plot(food$RedMeat, food$WhiteMeat, type="n", xlim=c(3,19), xlab="Red Meat",ylab="White Meat") | |
text(x=food$RedMeat, y=food$WhiteMeat, labels=food$Country,col=pam.result$clustering+1) |
Happy Learning!!!
No comments:
Post a Comment