Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): February 2016

February 29, 2016

Naive Bayes Classifier

Naive Based Classifier Notes and Examples

Work on assumption occurrence of word i is not dependent on occurrence of word i+1
Usually a sentence will have context only when words occur with appropriate terms and positions
For example purpose, we have listed below two classes and a testing document to classify the same

Ref - Link

Bayesian data analysis is a fundamental concept in data science. But it took me 2 years to understand its importance. In 2 minutes, I'll share my best findings over the last 2 years exploring Bayesian Modeling. Let's go.

1. Why Bayesian Data Analysis? Bayesian modeling is a… pic.twitter.com/nowKIL4AB4
— 🔥 Matt Dancho (Business Science) 🔥 (@mdancho84) February 6, 2024

Happy Learning!!!

February 23, 2016

Hierarchical Clustering

Compute distance in every pair of cluster
Merge nearest ones until number of clusters = number of clusters needed
Entire process can be represented as dendrogram
At the end of the algorithm dendogram is plotted

Measuring Distance between clusters

Single (Minimum Distance between two pairs one from each clusters)
Complete (Maximum between two pairs one from each clusters)
Average (Average of all possible pairs)


	library(cluster)
	#Avoid country name
	#use euclidean
	#complete linkage mechanism
	#diss - false as you are passing data frame
	foodagg=agnes(food[,-1],diss=FALSE,metric="euclidian", method="complete")
	plot(foodagg)

	#To get the required number of cluster
	cutree(foodagg,k=5)

	# Agglomerative hierarchical clustering (agnes)
	library(cluster)
	ah <- agnes(food[,-1])
	plot(ah)

	#Divisive hierarchical clustering (diana)
	library(cluster)
	dh <- diana(food[,-1])
	plot(dh)

	#Examples http://www.math.wustl.edu/~victor/classes/ma322/r-eg-28.txt

view raw Hierarchical.R hosted with ❤ by GitHub

Happy Learning!!!

K-medoids, K-means

Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means

Prone to outliers (Squared Euclidean gives greater weight to more distant points)
Can't handle categorical data
Work with Euclidean only

K-Medoids

Restrict centre to data points
Centre picked up only from data points
We use same sum of squares for cost function but distance is not Euclidean distance
Use your own custom distance functions when involved with numerical and categorical variables
Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute

Distance measure for numerical variables

Euclidean based distance
Correlation based distance
Mahalanobis distance

Distance measure for category variables

Matching coef and Jaquard’s coef

	#K medioids (Non-Hierarchical Clustering)
	#data frame
	food = read.csv("protein.csv")
	#Pass DF, Number of Clusters
	pam.result <- pam(food[,-1],2)
	pam.result$clustering
	summary(pam.result)

	#use manhattan measure
	#Pass DF, Number of Clusters
	#Argument diss=TRUE indicates that we use the dissimilarity matrix
	#Partitioning Around Medoids
	pam.result <- pam(food[,-1],k=2,diss=FALSE,metric="manhattan")
	pam.result$clustering
	summary(pam.result)

	plot(food$RedMeat, food$WhiteMeat, type="n", xlim=c(3,19), xlab="Red Meat",ylab="White Meat")
	text(x=food$RedMeat, y=food$WhiteMeat, labels=food$Country,col=pam.result$clustering+1)

view raw Kmediods.R hosted with ❤ by GitHub

Happy Learning!!!

February 22, 2016

R and SQL Server

This post is example for querying SQL Server and visualizing data using twitter. Package used is ROBDC. Sample walk through code snippet provided.


	library(ROBDC)
	conn <- odbcDriverConnect("Driver=SQL Server; Server=10.10.10.10,1500; Database=TestDB; Uid=useyouruser; Pwd=useyourpwd;")
	resultsdata <- sqlQuery(conn, "SELECT distinct(name), COUNT(1) as 'ZCount' FROM [TestDB].[dbo].[TableA] S JOIN [TestDB].[dbo].[TableB] Z ON S.id = Z.id group by name having count(1) > 20")
	odbcClose(conn)
	dim(resultsdata)

	library(lattice)
	dotplot(resultsdata$name~resultsdata$ZCount)

	slist <- resultsdata$sname
	zcount <- resultsdata$ZCount
	ggplot(data.frame(zcount,slist), aes(zcount,slist)) + geom_point()

	#highlight points
	ggplot(data.frame(zcount,slist), aes(zcount,slist)) + geom_point(size=3, colour="#CC0000")

	#background color change
	p<-ggplot(data.frame(zcount,slist), aes(zcount,slist)) + geom_point(size=3)
	p + theme(panel.background = element_rect(fill = 'green', colour = 'red'))
	ggsave("d:\\splot.png")

	#bar chart
	p = ggplot(data.frame(slist,zcount), aes(slist,zcount,fill=zcount)) + geom_bar(stat="identity")
	p + scale_x_discrete(labels = abbreviate)
	ggsave("d:\\splot.png")

view raw RSqlServer.R hosted with ❤ by GitHub

Happy Learning!!!

February 19, 2016

R Kaggle Exercise - Baby Names

	#Download Data from https://www.kaggle.com/kaggle/us-baby-names
	#Playing with source script from https://www.kaggle.com/jagelves/d/kaggle/us-baby-names/2014-popular-baby-names-by-state/files

	#setwd - set current working directory
	#load data
	Names=read.csv("StateNames.csv")

	#Pick Baby Boy Names
	Names2014M=Names[Names$Year==2014 & Names$Gender=="M",]

	#List and see few rows in output
	head(Names2014M)

	#Aggregate by state
	df.agg = aggregate(Count~State,Names2014M,max)

	Names2014Max=merge(df.agg,Names2014)

	#Sort by A..Z
	Names2014Max$State = factor(Names2014Max$State, levels = Names2014Max$State[order(Names2014Max$State)])

	library(ggplot2)

	#plot data
	#Names2014Max - Data Frame
	#aes - Generate aesthetic mappings of variables
	#geom_tile - Tile plot as densely as possible, assuming that every tile is the same size.
	#fill: internal colour
	#Changing colors is easy. Simply provide different string or hex values in the #scale_fill_gradient function

	ggplot(Names2014Max, aes(State, Name)) +
	geom_tile(aes(fill = Count), colour = "black") +
	scale_fill_gradient(low = "white", high = "blue")

view raw BabyNames.R hosted with ❤ by GitHub

Happy Learning!!!

R plot examples, matrix, aggregates, conditions examples

	#Create a vector from 1 to 20, Step by 2
	a <- c(seq(1,20,2))

	#Create a 5 X 2 Matrix
	m <- matrix(a, nrow = 5, ncol = 2, byrow=TRUE)
	m

	#Create a 2 X 5 Matrix
	m <- matrix(a, nrow = 2, ncol = 5, byrow=TRUE)
	m

	#transpose
	t(m)

	#Create Square matrix
	m <- matrix(a, nrow = 2, ncol = 2, byrow=TRUE)
	det(m)

	#matrix multiplication, Operator %*%
	m%*%m

	#eigen values
	eigen(m)

	#svd - singular vector decomposition
	svd(m)

view raw Matrix.R hosted with ❤ by GitHub

	#vector with hundred elements
	m <- c(seq(1,100,1))

	#List all values > 10
	m[m>10]

	#List all values > 10 and < 50
	m[m>10 & m < 50]

	#List all values > 10 and !=50
	m[m>10 & m!= 50]

	#Conditional select from data frame
	food = read.csv("protein.csv")
	newdata <- food[(food$RedMeat>5),]
	newdata

	#Load Some Sample Data
	dat <- read.table(textConnection('Group Score Info
	+ 1 1 1 a
	+ 2 1 2 b
	+ 3 1 3 c
	+ 4 2 4 d
	+ 5 2 3 e
	+ 6 2 1 f'))

	#print summary
	summary(dat)

	#Aggregations
	aggregate(Score~Group,dat,sum)
	aggregate(Score~Group,dat,mean)

	#functions
	cellbillcompute<-function()
	{
	billdays<- c(55,10,15,33,21,33,45,66,35,25)
	#max value
	print(max(billdays))
	#min value
	print(min(billdays))
	#sum of all bill days
	print(sum(billdays))
	#Number of days bill value > 20
	print(length(billdays[billdays>20]))
	}
	cellbillcompute()

view raw AggregatesConditions.R hosted with ❤ by GitHub

	#Installing Packages
	remove.packages(c("ggplot2", "data.table"))
	install.packages('Rcpp', dependencies = TRUE)
	install.packages('ggplot2', dependencies = TRUE)
	install.packages('data.table', dependencies = TRUE)

	#List all packages
	data()

	#Visualization Examples
	carsmodel <- c("Dzire", "Vitara", "ALTO", "Gypsy", "Baleno")
	sales <- c(200, 555, 424, 599, 12000)

	#plot example
	plot(factor(carsmodel),sales,type="o",col="green",pch=22)

	#dotplot
	library(lattice)
	dotplot(sales~carsmodel)

	#Connected Lines
	dotplot(sales~carsmodel,type="b")

	library(ggplot2)
	qplot(carsmodel,sales)
	ggplot(data.frame(carsmodel,sales), aes(carsmodel,sales)) + geom_point()

	#plot for share values
	library("quantmod")
	getSymbols('TYC')
	chartSeries(TYC, subset='last 3 months')
	addBBands()

	food = read.csv("protein.csv")
	#Find Outliers
	boxplot(food$RedMeat)
	#boxplot with multiple variables
	boxplot(food$RedMeat,food$WhiteMeat, food$Eggs)
	#Histogram of data
	hist(food$RedMeat)
	#Summary
	summary(food)

	#heatmap
	library(rgl)
	dist = as.matrix(food[-1])
	heatmap(dist)

view raw PlotExamples.R hosted with ❤ by GitHub

Happy Learning!!!

February 18, 2016

Cluster Analytics - Deep Dive on K Means

Had a Good Session on K-means clustering. Code snippet, notes in this post

Clustering - Assignment of observation into subsets that are similar in some sense

K-Means Clustering

Highly used algorithm
You need to decide on number of data groups

How it works ?

Start with random guess of cluster centres
Go through every point and compute distance between C1 and C2 (Cluster Centres)

How to measure good clustering ?

Intra cluster distances minimized
Inter cluster distances maximized

Cost function

Sum of squared distances from each point to its cluster center should be minimum
Iteration to Iteration cost function will keep decreasing

Learning Points

While trying to measure, global cluster centre, local cluster centre points were identified
For all cluster points, sum of squares computed with global cluster centre

Mathematical Learnings

What is sum-of-squared distances method ?

What is Euclidean Distance ?

Distance between two points in the plane with coordinates (x, y) and (a, b) is given by
Link - ref

What is local optima ?

Local optima are defined as the relative best solutions within a neighbour solution set.

How to choose value of K ?

Elbow method - Plot with number of clusters and compute cost function. When there is sharp decline that would denote optimum number of clusters

	#Source of Data https://github.com/siva2k16/STA380/tree/master/data/protein.csv
	food = read.csv("protein.csv")

	#Return first few records
	head(food)
	set.seed(1)

	#Clusters - 3
	#nstart - Number of times 10 X 10 (10 is number of runs)
	#centers=3 is user input
	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10)

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10,algorithm ="MacQueen")
	grpMeat$cluster
	grpMeat$centers
	grpMeat$withinss
	grpMeat$size
	# Plot results
	plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 3 clusters", pch=20, cex=2)

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10,algorithm ="Lloyd")
	grpMeat$cluster
	grpMeat$centers
	grpMeat$withinss
	grpMeat$size

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=10,algorithm ="Forgy")
	grpMeat$cluster
	grpMeat$centers
	grpMeat$withinss
	grpMeat$size

	#Plot for 4 Variables
	grpMeat <- kmeans(food[,c("WhiteMeat","RedMeat","Eggs","Milk")], centers=3,nstart=10,algorithm ="Forgy")
	grpMeat$cluster
	grpMeat$centers
	grpMeat$withinss
	grpMeat$size

	#Python example - https://medium.com/nerd-for-tech/k-means-clustering-using-python-2150769bd0b9

view raw kmeans.R hosted with ❤ by GitHub

Elbow method - Plot for Elbow method. At centre = 3 there is a steep fall which means 3 is optimum number.

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=2,nstart=1)
	plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 2 clusters", pch=20, cex=2)
	a = sum(grpMeat$withinss)

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=3,nstart=1)
	plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 3 clusters", pch=20, cex=2)
	b = sum(grpMeat$withinss)

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=4,nstart=1)
	plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 4 clusters", pch=20, cex=2)
	c = sum(grpMeat$withinss)

	grpMeat <- kmeans( food[,c("WhiteMeat","RedMeat")], centers=5,nstart=1)
	plot(food[,c("WhiteMeat","RedMeat")], col =(grpMeat$cluster +1), main="K-Means result with 5 clusters", pch=20, cex=2)
	d = sum(grpMeat$withinss)

	x=c(2,3,4,5)
	y=c(a,b,c,d)
	plot(x,y,type="o")

	#determine optiomal number of clusters using NbClust Package
	library(NbClust)
	food = read.csv("protein.csv")
	#For consitent results set seed value
	set.seed(1)
	numberofclusters <- NbClust(food[,c("WhiteMeat","RedMeat")],min.nc=2,max.nc=15,method="kmeans")
	# 3 is result shown by this test
	table(numberofclusters$Best.n[1,])

view raw Elbowplot.R hosted with ❤ by GitHub

Hard Clustering - Object belongs to only one cluster. Element can fall in only one cluster.
Soft Clustering - Some object belong to different clusters. Probability of how much it would fit in that cluster

Other Techniques

Remove correlation before computing distances
Mahalanobis distance measure
(1-correlation coefficient)

More Reads

K Means Clustering in R Example
K Means Clustering by Hand / Excel
K means Clustering in R example Iris Data
Linear Regression Example in R using lm() Function
Linear Regression by Hand and in Excel

K-medoids and K-means
Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means

Prone to outliers (Squared Euclidean gives greater weight to more distant points)
Can't handle categorical data
Work with Euclidean only

K-Medoids

Restrict centre to data points
Centre picked up only from data points
We use same sum of squares for cost function but distance is not Euclidean distance

	#K medioids (Non-Hierarchical Clustering)
	#https://github.com/jgscott/STA380/blob/master/data/protein.csv
	#data frame
	food = read.csv("protein.csv")
	#Pass DF, Number of Clusters
	pam.result <- pam(food[,-1],2)
	pam.result$clustering
	summary(pam.result)

	#use manhattan measure
	#Pass DF, Number of Clusters
	#Argument diss=TRUE indicates that we use the dissimilarity matrix
	pam.result <- pam(food[,-1],k=2,diss=FALSE,metric="manhattan")
	pam.result$clustering
	summary(pam.result)
	plot(food$RedMeat, food$WhiteMeat, type="n", xlim=c(3,19), xlab="Red Meat",ylab="White Meat")
	text(x=food$RedMeat, y=food$WhiteMeat, labels=food$Country,col=pam.result$clustering+1)

view raw K_mediods.R hosted with ❤ by GitHub

Distance measure for numerical variables

Euclidean based distance
Correlation based distance
Mahalanobis distance

Distance measure for category variables

Matching coef and Jaquards coef

Measuring Distance between clusters

Single (Minimum Distance between two pairs one from each clusters)
Complete (Maximum between two pairs one from each clusters)
Average (Average of all possible pairs)

Hierarchical Clustering

Compute distance in every pair of cluster
Manage nearest ones until number of clusters = number of clusters needed
Entire process can be represented as dendrogram
At the end of the algorithm it is plotted

	#K means custom distance function
	#K mediods - Custom distance
	#fpc package (K mediods)
	library(cluster)
	#Avoid country name
	#use euclidean
	#complete linkage mechanism
	#diss - false as you are passing data frame
	foodagg=agnes(food[,-1],diss=FALSE,metric="euclidian", method="complete")
	plot(foodagg)
	#To get the required number of cluster
	cutree(foodagg,k=5)

view raw HierarchicalCluster.R hosted with ❤ by GitHub

Happy Learning!!!

February 16, 2016

DBMS Session Three

Relational Model Classification

Key - Subset of attribute
Super Key -Sufficient to identify tuple uniquely
Considerations for primary key

not null values
Few attributes
Key often used in data access clauses

Relational Operators

Selection (With Filters Applied)
Projection (Select with explicitly specified columns)
Cartesian product
Union
Difference
Intersection
Join

SQL Refreshers

SELECT, WHERE, Aggregate (Group by, Having), JOINs, String operations
SET Operations, Handling NULL values, Subqueries
DELETION, No EXISTS, Conditional Updates
Views, Materialized views
Authentication & Authorization (Roles & Permissions)

Happy Learning!!!

February 11, 2016

Data Models

Captured Notes from Session #2 - Data Models

Hierarchical Data Models

Tree like structures
Used in Windows Registry
Frequent Use (IMS)
DL / 1 Programming language for IMS
Difficult to reorganize

Graph / Network Model

Organize collection of records in form of directed graph
3 way relationships can't be maintained

ER Model

Defined in terms of Entity, Relationships
Never Caught on Physical Model

Object Oriented Database model

Difficult mapping programming objects to database objects

Relational Model

Better physical data independence
Better logical independence
Won because of linear algebra

Happy Learning!!!

February 01, 2016

My second semester classes started. The first session was very interesting and a great introduction to world of data science. I have read / re-read same type of definitions / introductory articles on data science. Prof.Manish Singh session gave a whole new analogy and interesting examples to correlate with.

For big data I have always referred back to 4 Vs. Volume, Veracity, Velocity and Variety. In the same analogy the definition was presented as

Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
Internet of People - Email, Facebook, Linkedin etc
Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
Internet of Location - Spatial Data related analysis

This Internet of * is a good representation of different forms / flows of information representing four Vs

Big Data = Crude Oil

"Big data is about extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’ and storing it in massive ‘silos’"

Data Science – Data science is inter disciplinary field to extract knowledge from data.

Data Science workflow involves Data Visualization, Data Analysis, Data processing and Data Storage tasks. Some of tools used in each layer are listed below.

	Tools available
Data Visualization	Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool)
Data Analysis	Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR
Data Processing	Scheduler – Azkaban, Oozie, Ivory Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper
Data Storage	HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System) Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue Columnar Storage – Zebra Key Value - Hbase

The key ingredients of Data Science are

· Data Management System

· Data Mining

· Computational process to identify patterns in large data sets

· Use techniques at intersection of multiple disciplines (AI, Stats, Machine Learning, Computer Networks)

· Data Classification, Clustering, regression and association rule finding and anomaly detection

· Process Mining

· Aim to discover, monitor, improve real time processes (eg logs, events, alerts, rules)

· Information Visualization

· Visualization techniques for large data sets, Interactive Information Visualization, How to really visualize big data

Databases Vs Data Science

	Databases	Data Science
Data Value	Previous	Cheap
Data Volume	Modest	Massive
Structured	Strongly (Schema)	Weakly or none (text)
Priorities	Consistency, Error Recovery, Auditability	Speed, Availability, Query richness
Base	Relational Algebra	Linear algebra

PS: My professor had provided references to the examples; I am sharing this post based on notes / slides from my session.

Happy Learning!!!

February 29, 2016

February 23, 2016

February 22, 2016

February 19, 2016

February 18, 2016

February 16, 2016

February 11, 2016

February 01, 2016

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts