"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 23, 2016

K-medoids, K-means

Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
  • Use your own custom distance functions when involved with numerical and categorical variables
  • Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute
Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquard’s coef
#K medioids (Non-Hierarchical Clustering)
#data frame
food = read.csv("protein.csv")
#Pass DF, Number of Clusters
pam.result <- pam(food[,-1],2)
pam.result$clustering
summary(pam.result)
#use manhattan measure
#Pass DF, Number of Clusters
#Argument diss=TRUE indicates that we use the dissimilarity matrix
#Partitioning Around Medoids
pam.result <- pam(food[,-1],k=2,diss=FALSE,metric="manhattan")
pam.result$clustering
summary(pam.result)
plot(food$RedMeat, food$WhiteMeat, type="n", xlim=c(3,19), xlab="Red Meat",ylab="White Meat")
text(x=food$RedMeat, y=food$WhiteMeat, labels=food$Country,col=pam.result$clustering+1)
view raw Kmediods.R hosted with ❤ by GitHub

Happy Learning!!!

No comments: