"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 29, 2016

Naive Bayes Classifier

Naive Based Classifier Notes and Examples

  • Work on assumption occurrence of word i is not dependent on occurrence of word i+1
  • Usually a sentence will have context only when words occur with appropriate terms and positions
  • For example purpose, we have listed below two classes and a testing document to classify the same













Ref - Link

Happy Learning!!!

February 23, 2016

Hierarchical Clustering


  • Compute distance in every pair of cluster
  • Merge nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm dendogram is plotted
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)
library(cluster)
#Avoid country name
#use euclidean
#complete linkage mechanism
#diss - false as you are passing data frame
foodagg=agnes(food[,-1],diss=FALSE,metric="euclidian", method="complete")
plot(foodagg)
#To get the required number of cluster
cutree(foodagg,k=5)
# Agglomerative hierarchical clustering (agnes)
library(cluster)
ah <- agnes(food[,-1])
plot(ah)
#Divisive hierarchical clustering (diana)
library(cluster)
dh <- diana(food[,-1])
plot(dh)
#Examples http://www.math.wustl.edu/~victor/classes/ma322/r-eg-28.txt
view raw Hierarchical.R hosted with ❤ by GitHub

Happy Learning!!!

K-medoids, K-means

Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
  • Use your own custom distance functions when involved with numerical and categorical variables
  • Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute
Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquard’s coef
#K medioids (Non-Hierarchical Clustering)
#data frame
food = read.csv("protein.csv")
#Pass DF, Number of Clusters
pam.result <- pam(food[,-1],2)
pam.result$clustering
summary(pam.result)
#use manhattan measure
#Pass DF, Number of Clusters
#Argument diss=TRUE indicates that we use the dissimilarity matrix
#Partitioning Around Medoids
pam.result <- pam(food[,-1],k=2,diss=FALSE,metric="manhattan")
pam.result$clustering
summary(pam.result)
plot(food$RedMeat, food$WhiteMeat, type="n", xlim=c(3,19), xlab="Red Meat",ylab="White Meat")
text(x=food$RedMeat, y=food$WhiteMeat, labels=food$Country,col=pam.result$clustering+1)
view raw Kmediods.R hosted with ❤ by GitHub

Happy Learning!!!

February 22, 2016

R and SQL Server

This post is example for querying SQL Server and visualizing data using twitter. Package used is ROBDC. Sample walk through code snippet provided.

library(ROBDC)
conn <- odbcDriverConnect("Driver=SQL Server; Server=10.10.10.10,1500; Database=TestDB; Uid=useyouruser; Pwd=useyourpwd;")
resultsdata <- sqlQuery(conn, "SELECT distinct(name), COUNT(1) as 'ZCount' FROM [TestDB].[dbo].[TableA] S JOIN [TestDB].[dbo].[TableB] Z ON S.id = Z.id group by name having count(1) > 20")
odbcClose(conn)
dim(resultsdata)
library(lattice)
dotplot(resultsdata$name~resultsdata$ZCount)
slist <- resultsdata$sname
zcount <- resultsdata$ZCount
ggplot(data.frame(zcount,slist), aes(zcount,slist)) + geom_point()
#highlight points
ggplot(data.frame(zcount,slist), aes(zcount,slist)) + geom_point(size=3, colour="#CC0000")
#background color change
p<-ggplot(data.frame(zcount,slist), aes(zcount,slist)) + geom_point(size=3)
p + theme(panel.background = element_rect(fill = 'green', colour = 'red'))
ggsave("d:\\splot.png")
#bar chart
p = ggplot(data.frame(slist,zcount), aes(slist,zcount,fill=zcount)) + geom_bar(stat="identity")
p + scale_x_discrete(labels = abbreviate)
ggsave("d:\\splot.png")
view raw RSqlServer.R hosted with ❤ by GitHub
Happy Learning!!!

February 19, 2016

R Kaggle Exercise - Baby Names

#Download Data from https://www.kaggle.com/kaggle/us-baby-names
#Playing with source script from https://www.kaggle.com/jagelves/d/kaggle/us-baby-names/2014-popular-baby-names-by-state/files
#setwd - set current working directory
#load data
Names=read.csv("StateNames.csv")
#Pick Baby Boy Names
Names2014M=Names[Names$Year==2014 & Names$Gender=="M",]
#List and see few rows in output
head(Names2014M)
#Aggregate by state
df.agg = aggregate(Count~State,Names2014M,max)
Names2014Max=merge(df.agg,Names2014)
#Sort by A..Z
Names2014Max$State = factor(Names2014Max$State, levels = Names2014Max$State[order(Names2014Max$State)])
library(ggplot2)
#plot data
#Names2014Max - Data Frame
#aes - Generate aesthetic mappings of variables
#geom_tile - Tile plot as densely as possible, assuming that every tile is the same size.
#fill: internal colour
#Changing colors is easy. Simply provide different string or hex values in the #scale_fill_gradient function
ggplot(Names2014Max, aes(State, Name)) +
geom_tile(aes(fill = Count), colour = "black") +
scale_fill_gradient(low = "white", high = "blue")
view raw BabyNames.R hosted with ❤ by GitHub
Happy Learning!!!

R plot examples, matrix, aggregates, conditions examples

#Create a vector from 1 to 20, Step by 2
a <- c(seq(1,20,2))
#Create a 5 X 2 Matrix
m <- matrix(a, nrow = 5, ncol = 2, byrow=TRUE)
m
#Create a 2 X 5 Matrix
m <- matrix(a, nrow = 2, ncol = 5, byrow=TRUE)
m
#transpose
t(m)
#Create Square matrix
m <- matrix(a, nrow = 2, ncol = 2, byrow=TRUE)
det(m)
#matrix multiplication, Operator %*%
m%*%m
#eigen values
eigen(m)
#svd - singular vector decomposition
svd(m)
view raw Matrix.R hosted with ❤ by GitHub
#vector with hundred elements
m <- c(seq(1,100,1))
#List all values > 10
m[m>10]
#List all values > 10 and < 50
m[m>10 & m < 50]
#List all values > 10 and !=50
m[m>10 & m!= 50]
#Conditional select from data frame
food = read.csv("protein.csv")
newdata <- food[(food$RedMeat>5),]
newdata
#Load Some Sample Data
dat <- read.table(textConnection('Group Score Info
+ 1 1 1 a
+ 2 1 2 b
+ 3 1 3 c
+ 4 2 4 d
+ 5 2 3 e
+ 6 2 1 f'))
#print summary
summary(dat)
#Aggregations
aggregate(Score~Group,dat,sum)
aggregate(Score~Group,dat,mean)
#functions
cellbillcompute<-function()
{
billdays<- c(55,10,15,33,21,33,45,66,35,25)
#max value
print(max(billdays))
#min value
print(min(billdays))
#sum of all bill days
print(sum(billdays))
#Number of days bill value > 20
print(length(billdays[billdays>20]))
}
cellbillcompute()
#Installing Packages
remove.packages(c("ggplot2", "data.table"))
install.packages('Rcpp', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
install.packages('data.table', dependencies = TRUE)
#List all packages
data()
#Visualization Examples
carsmodel <- c("Dzire", "Vitara", "ALTO", "Gypsy", "Baleno")
sales <- c(200, 555, 424, 599, 12000)
#plot example
plot(factor(carsmodel),sales,type="o",col="green",pch=22)
#dotplot
library(lattice)
dotplot(sales~carsmodel)
#Connected Lines
dotplot(sales~carsmodel,type="b")
library(ggplot2)
qplot(carsmodel,sales)
ggplot(data.frame(carsmodel,sales), aes(carsmodel,sales)) + geom_point()
#plot for share values
library("quantmod")
getSymbols('TYC')
chartSeries(TYC, subset='last 3 months')
addBBands()
food = read.csv("protein.csv")
#Find Outliers
boxplot(food$RedMeat)
#boxplot with multiple variables
boxplot(food$RedMeat,food$WhiteMeat, food$Eggs)
#Histogram of data
hist(food$RedMeat)
#Summary
summary(food)
#heatmap
library(rgl)
dist = as.matrix(food[-1])
heatmap(dist)
view raw PlotExamples.R hosted with ❤ by GitHub
Happy Learning!!!

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering
  • Highly used algorithm
  • You need to decide on number of data groups
How it works ?
  • Start with random guess of cluster centres
  • Go through every point and compute distance between C1 and C2 (Cluster Centres)
How to measure good clustering ?
  • Intra cluster distances minimized
  • Inter cluster distances maximized
Cost function 
  • Sum of squared distances from each point to its cluster center should be minimum
  • Iteration to Iteration cost function will keep decreasing
Learning Points
  • While trying to measure, global cluster centre, local cluster centre points were identified
  • For all cluster points, sum of squares computed with global cluster centre
Mathematical Learnings 
What is sum-of-squared distances method ?
What is Euclidean Distance ?
  • Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
  • Link - ref
What is local optima ?
  • Local optima are defined as the relative best solutions within a neighbour solution set.
How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters
#Source of Data https://github.com/siva2k16/STA380/tree/master/data/protein.csv
food = read.csv("protein.csv")
#Return first few records
head(food)
set.seed(1)
#Clusters - 3
#nstart - Number of times 10 X 10 (10 is number of runs)
#centers=3 is user input
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10)
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10,algorithm ="MacQueen")
grpMeat$cluster
grpMeat$centers
grpMeat$withinss
grpMeat$size
# Plot results
plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 3 clusters", pch=20, cex=2)
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10,algorithm ="Lloyd")
grpMeat$cluster
grpMeat$centers
grpMeat$withinss
grpMeat$size
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10,algorithm ="Forgy")
grpMeat$cluster
grpMeat$centers
grpMeat$withinss
grpMeat$size
#Plot for 4 Variables
grpMeat <- kmeans(food[,c("WhiteMeat","RedMeat","Eggs","Milk")], centers=3,nstart=10,algorithm ="Forgy")
grpMeat$cluster
grpMeat$centers
grpMeat$withinss
grpMeat$size
#Python example - https://medium.com/nerd-for-tech/k-means-clustering-using-python-2150769bd0b9
view raw kmeans.R hosted with ❤ by GitHub


Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=2,nstart=1)
plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 2 clusters", pch=20, cex=2)
a = sum(grpMeat$withinss)
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=1)
plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 3 clusters", pch=20, cex=2)
b = sum(grpMeat$withinss)
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=4,nstart=1)
plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 4 clusters", pch=20, cex=2)
c = sum(grpMeat$withinss)
grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=5,nstart=1)
plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 5 clusters", pch=20, cex=2)
d = sum(grpMeat$withinss)
x=c(2,3,4,5)
y=c(a,b,c,d)
plot(x,y,type="o")
#determine optiomal number of clusters using NbClust Package
library(NbClust)
food = read.csv("protein.csv")
#For consitent results set seed value
set.seed(1)
numberofclusters <- NbClust(food[,c("WhiteMeat","RedMeat")],min.nc=2,max.nc=15,method="kmeans")
# 3 is result shown by this test
table(numberofclusters$Best.n[1,])
view raw Elbowplot.R hosted with ❤ by GitHub

  • Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
  • Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster
Other Techniques
  • Remove correlation before computing distances
  • Mahalanobis distance measure
  • (1-correlation coefficient)
More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
#K medioids (Non-Hierarchical Clustering)
#https://github.com/jgscott/STA380/blob/master/data/protein.csv
#data frame
food = read.csv("protein.csv")
#Pass DF, Number of Clusters
pam.result <- pam(food[,-1],2)
pam.result$clustering
summary(pam.result)
#use manhattan measure
#Pass DF, Number of Clusters
#Argument diss=TRUE indicates that we use the dissimilarity matrix
pam.result <- pam(food[,-1],k=2,diss=FALSE,metric="manhattan")
pam.result$clustering
summary(pam.result)
plot(food$RedMeat, food$WhiteMeat, type="n", xlim=c(3,19), xlab="Red Meat",ylab="White Meat")
text(x=food$RedMeat, y=food$WhiteMeat, labels=food$Country,col=pam.result$clustering+1)
view raw K_mediods.R hosted with ❤ by GitHub
 Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquards coef
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)
Hierarchical Clustering
  • Compute distance in every pair of cluster
  • Manage nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm it is plotted
  • #K means custom distance function
    #K mediods - Custom distance
    #fpc package (K mediods)
    library(cluster)
    #Avoid country name
    #use euclidean
    #complete linkage mechanism
    #diss - false as you are passing data frame
    foodagg=agnes(food[,-1],diss=FALSE,metric="euclidian", method="complete")
    plot(foodagg)
    #To get the required number of cluster
    cutree(foodagg,k=5)



Happy Learning!!!

February 16, 2016

DBMS Session Three

Relational Model Classification

Key - Subset of attribute
Super Key -Sufficient to identify tuple uniquely
Considerations for primary key
  • not null values
  • Few attributes
  • Key often used in data access clauses
Relational Operators
  • Selection (With Filters Applied)
  • Projection (Select with explicitly specified columns)
  • Cartesian product
  • Union
  • Difference
  • Intersection
  • Join
SQL Refreshers
  • SELECT, WHERE, Aggregate (Group by, Having), JOINs, String operations
  • SET Operations, Handling NULL values, Subqueries
  • DELETION, No EXISTS, Conditional Updates
  • Views, Materialized views
  • Authentication & Authorization (Roles & Permissions)
Happy Learning!!!




February 11, 2016

Data Models

Captured Notes from Session #2 - Data Models

Hierarchical Data Models
  • Tree like structures
  • Used in Windows Registry
  • Frequent Use (IMS)
  • DL / 1 Programming language for IMS
  • Difficult to reorganize
Graph / Network Model
  • Organize collection of records in form of directed graph
  • 3 way relationships can't be maintained
ER Model
  • Defined in terms of Entity, Relationships
  • Never Caught on Physical Model
Object Oriented Database model
  • Difficult mapping programming objects to database objects
Relational Model
  • Better physical data independence
  • Better logical independence
  • Won because of linear algebra
Happy Learning!!!

February 01, 2016

World of Data Science

My second semester classes started. The first session was very interesting and a great introduction to world of data science. I have read / re-read same type of definitions / introductory articles on data science. Prof.Manish Singh session gave a whole new analogy and interesting examples to correlate with.

For big data I have always referred back to 4 Vs. Volume, Veracity, Velocity and Variety. In the same analogy the definition was presented as
  • Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
  • Internet of People - Email, Facebook, Linkedin etc
  • Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
  • Internet of Location - Spatial Data related analysis 
This Internet of * is a good representation of different forms / flows of information representing four Vs

Big Data = Crude Oil

"Big data is about extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’ and storing it in massive ‘silos’"

Data Science – Data science is inter disciplinary field to extract knowledge from data.

Data Science workflow involves Data Visualization, Data Analysis, Data processing and Data Storage tasks. Some of tools used in each layer are listed below. 


Tools available

Data Visualization
Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool)

Data Analysis
Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR

Data Processing

Scheduler – Azkaban, Oozie, Ivory
Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper

Data Storage
HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System)
Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue
Columnar Storage – Zebra
Key Value - Hbase

The key ingredients of Data Science are
·         Data Management System
·         Data Mining
·         Computational process to identify patterns in large data sets
·         Use techniques at intersection of multiple disciplines (AI, Stats, Machine Learning, Computer Networks)
·         Data Classification, Clustering, regression and association rule finding and anomaly detection
·         Process Mining
·         Aim to discover, monitor, improve real time processes (eg logs, events, alerts, rules)
·         Information Visualization
·         Visualization techniques for large data sets, Interactive Information Visualization, How to really visualize big data


Databases Vs Data Science
Databases Data Science
Data Value Previous Cheap
Data Volume Modest Massive
Structured Strongly (Schema) Weakly or none (text)
Priorities Consistency, Error Recovery, Auditability Speed, Availability, Query richness
Base Relational Algebra Linear algebra

PS: My professor had provided references to the examples; I am sharing this post based on notes / slides from my session.  

Happy Learning!!!