Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): March 2016

March 31, 2016

March 30, 2016

Good Read - Winning Data Science Competitions

Interviews

Questions From Data Science Interviews

Let your work do the talking

Highly Creative Person - More(['work','attention','energy]') -> ['More Benefit']

Happy Learning!!!

Interesting Read - Data Science & Applications

A @Kaggle winner discusses his #MachineLearning secrets: https://t.co/OCIhKyuBRK #abdsc #BigData #DataScience pic.twitter.com/92dVvnaa13
— Kirk Borne (@KirkDBorne) March 21, 2016

Happy Reading!!!

March 29, 2016

Good Data Science Tech Talk

Today I spent some time with Tech talk on predictive modelling. Good Coverage of fundamentals, Needs revision again.

Read William Chen's answer to What are the most common mistakes made by aspiring data scientists? on Quora

Happy Learning!!!!

March 28, 2016

Data Science Day #12 - Text Processing in R

Today's post is basics on text processing using R. We look at removing stop words, numbers, punctuations, lower case conversion etc..


	#Assign vector list of messages
	data <- c("This AAAA is a CAPITAL is a tweet123455, . ,,,,,,","This 333 spam , ... prize","Congratz Prize Winner22222222222343434","^&^&&^&^&")

	#View few records
	head(data)

	#Library for data cleansing
	library(tm)

	#Convert to message corpus
	msg_corpus <- Corpus(VectorSource(data))

	inspect(msg_corpus[1:4])

	#Remove punctuation
	refine_corpus <- tm_map(msg_corpus, removePunctuation)

	#String whitespaces
	refine_corpus <- tm_map(refine_corpus, stripWhitespace)

	#Convert to lower case
	refine_corpus <- tm_map(refine_corpus, content_transformer(tolower))

	#Remove Numbers
	refine_corpus <- tm_map(refine_corpus, removeNumbers)

	#Remove Stop Words
	refine_corpus <- tm_map(refine_corpus, removeWords, stopwords())

	writeLines(as.character(msg_corpus[1]))
	writeLines(as.character(refine_corpus[1]))

	writeLines(as.character(msg_corpus[2]))
	writeLines(as.character(refine_corpus[2]))

	writeLines(as.character(msg_corpus[3]))
	writeLines(as.character(refine_corpus[3]))

	writeLines(as.character(msg_corpus[4]))
	writeLines(as.character(refine_corpus[4]))

view raw RTextProcess.R hosted with ❤ by GitHub

	#set the working directory
	setwd("E:/Data/")

	#Tip #1 - List and iterate all files in directory
	filenames <- list.files(pattern="*.csv")
	print(filenames)

	#Tip #2 - Loop through all files
	for(i in seq_along(filenames))
	{
	print(filenames[i])
	}

	#Tip #3 - Initialize Empty Vectors
	a <- vector(mode="numeric", length=0)
	b <- vector(mode="numeric", length=0)

view raw RFilebasics.R hosted with ❤ by GitHub

Happy Learning!!!

March 27, 2016

Day #11 - Data Science Learning Notes - Evaluating a model

R Square - Goodness of Fit Test

R square = (1- (Sum of Squares of Error/Sum of Squares of Total))
SST - Variance of dependent variable
SSE - Variance of Actual vs Predicted Values

Adjusted R Square

Adjusted R Square = (1-((n-1)/(n-p-1)))(1-RSquare)
P - Number of independent variables
n - records in dataset

RMSE (Root mean square error)

For every record predicted compute error
Square it and find mean
RMSE error should be same for training and testing dataset

Bias (Underfit)

Model can't explain the dataset
R Square value very less
Add more Independent variable

Variance

RMSE High for test dataset, RMSE low for training dataset
Cut down Independent variable

Collinearity Problem

Conduct P test to validate null hypothesis is valid

Next Pending Reads

Subset Selection Technique
Cross Validation Technique
Z test / P Test

Happy Learning!!!

March 22, 2016

Day #10 Data Science Learning - Correlations

Correlation

If you have correlation you can use machine learning to predict variables
Mutual relationship connection between two or more things
Correlation shows inter dependence between two variables
Measure - How much one changes when other also changes ?
Popularly Used - Pearson Correlation coefficient
Value ranges from -1 to +1
Negative correlation (Closer to -1) - One value goes up other goes down
Closer to Zero (No Correlation)
Closer to 1 (Positive Correlation)

Correlation - Relationship between two values
Causation - Reason for change in value (Cholesterol vs weight, Dress Size Vs Cholesterol). Identify if it is incidental.

Handling highly correlated variables

Remove two correlated variables, unless correlation=1 or -1, in which case one of the variables is redundant.
Perform a PCA
Permutation feature importance (Link)
Greedy Elimination (GE): iteratively eliminate feature of highest correlated feature pair
Recursive Feature Elimination (RFE): recursive elimination of features with respect to their importance
Lasso Regulariosion (LR): use L1 regularisation to remove features with zero weight
Principle Component Analysis (PCA): transform data set with PCA and choose components with highest variations

Ref - Link1, Link2 , Link3

Happy Learning!!!

March 21, 2016

Day #9– Data Science Basics Maths Notes

Happy Learning!!!

March 20, 2016

Data Science Tip Day #8 - Dealing with Skewness for Error histograms after Linear Regression

In previous posts we have seen histogram based validation for errors. When a left / right skew based distribution is observed some transformation techniques to apply are

Right Skewed - Apply Log
Slightly Right - Square root
Left Skewed - Exponential
Slightly Left - Square root, Cube

Happy Learning!!!

March 19, 2016

Data Science Tip Day#7 - Interaction Variables

This post is using interaction variables while performing linear regression

For illustration purpose lets construct some datasets with a three vectors (y,x,z)

	y = c(seq(1,200,by=2))
	x = c(seq(1,404,by=4))
	z = c(seq(100,200,by=1))

	#x*z is new interaction variable
	p = x*z
	model = lm(y~x+p)

	#Validating the model
	plot(model$fitted.values, model$residual.values)
	#Has some pattern (Failure - Not convincing representation)

	hist(model$residuals)
	#Left skewed Histogram of normal distribution (ok ok)

view raw InteractionVariable.R hosted with ❤ by GitHub

Happy Learning!!!

Data Science Tip Day#6 - Linear Regression, Polynomial Regression

We will be renaming the R tips to Data Science Tips moving on. Today we will look at Linear Regression, Polynomial & Variable Interactions. Today's stats class was very useful and interesting. Any topic of interest needs practise to master the fundamentals.

Linear Regression
Assume a Linear Equation y = mx+b

Here m is slope, b is intercept value at x = 0
Similarly, we use the equation to express relationship between dependent and independent variables
In this equation y = mx+b, y is the dependent variable. x is the independent variable

For illustration purpose let's construct some datasets with a two vectors (y,x)

	y = c(seq(1,200,by=2))
	x = c(seq(1,400,by=4))

	#Model#1
	model1 = lm(y~x)

	#To view values of intercept and slope the formula is
	model1
	summary(model1)

	#Validating the model
	#Approach #1. When you plot between fitted and residual values it should not have any bowl pattern
	plot(model1$fitted.values, model1$residual.values)
	#Has some pattern (Failure)

	#Approach #2. When you plot histogram of errors it must be a normal distrution (This is due to central limit theorem)
	hist(model1$residuals)
	# Not a Histogram of normal distribution (Failure)

	#Model#2
	#two variables x, x*x (x squares)
	model2 = lm(y~x+x*x)

	#To view values of intercept and slope the formula is
	model2
	summary(model2)

	#Validating the model
	#Approach #1. When you plot between fitted and residual values it should not have any bowl pattern
	plot(model2$fitted.values, model2$residual.values)
	#Has some pattern (Failure)

	#Approach #2. When you plot histogram of errors it must be a normal distrution (This is due to central limit theorem)
	hist(model2$residuals)
	#Not a Histogram of normal distribution (Failure)

	#Other Failed Models
	#below example is polynomial model usage
	model3 = lm(y~xxx)
	hist(model3$residuals)

	#Best Working Model
	#The best slightly left skewed I could come up is with this equation
	model4 = lm(y~sqrt(x))
	hist(model4$residuals)

view raw LinearRegression.R hosted with ❤ by GitHub

Happy Learning!!!

March 17, 2016

R Day #5 Tip of the Day - Linear Regression

I have taken up Udemy Data Science Classes. Below notes from Linear Regression Classes

Linear Regression

Analyze relationship between two or multiple variables
Goal is preparing equation (Outcome y, Predictors x)
Estimate value of dependent and independent variables using relationship equations
Used for Continuous variables that have some correlation between them
Goodness of fit test to validate model

Linear Equations

Explains relationship between two variables
X (Independent- Predictor), Y (Dependent)
Y = AX + B
A (Slope) = (Y/X)
B - Intercept (Value of Y when X =0)
Equation becomes predictor of Y

Fitting Line

Sum of squares of vertical distances minimal
Best Line = least residual
Difference between model and actual values are called as residuals

Goodness of Fit

R square measure
Sum of squares of distances (Sum of squares of vertical distances minimal)
Uses residual values
Higher R square value better the fit (close to 1 higher fit)
Higher Correlation means better fit (R square will also be high)

Multiple Regression

Multiple predictors involved (X Values)
More than one independent variable used to predict dependent variable
Y = A1X1 + A2X2 + A3X3 +ApXp + B

11 Important Model Evaluation Techniques Everyone Should Know

Homoscedasticity - all random variables in the sequence or vector have the same finite variance

heteroscedasticity - variability of a variable is unequal across the range of values of a second variable that predicts it

Pearson's correlation coefficient (r) is a measure of the strength of the association between the two variables

Happy Learning!!!

March 16, 2016

R Day #4 - Tip for the Day - Learning Resources

I came across this site NYU Stats R Site

Please bookmark it for a good walkthru and learning on R Core topics

Happy Learning!!!

March 15, 2016

Good Reads - Data Science

Read Joaquin Quiñonero Candela's answer to In applied Machine Learning what is more important: data, infrastructure, or algorithms? on Quora

Read Joaquin Quiñonero Candela's answer to What do you look for when hiring someone for your team? on Quora

Happy Learning!!!

March 13, 2016

R Day #3 - Tip for the Day

This is based on reading from notes from link

Logistic Regression

Applied when response is binary
(0/1, yes/No etc..), Also known as dichotomous outcome variable

Binomial probability model

consists of (i) n independent trials where
(ii) each trial results in one of two possible outcomes (Yes/No, 1/0)
(iii) the probability p of a success stays the same for each trial

Maximum likelihood - Find the value of the parameter(s) (in this case p) which makes the observed data most likely to have occurred

Poisson Regression
Applied for below situations

The occurrences of the event of interest in non-overlapping “time” intervals are independent
The probability two or more events in a small time interval is small, and
The probability that an event occurs in a short interval of time is proportional to the length of the time interval
Heteroscedasticity - means unequal error variances

Negative Binomial Model

The Poisson model does not always provide a good fit to a count response.
An alternative model is the negative binomial distribution

Happy Learning!!!

March 11, 2016

Day #2 - Multivariate Linear Regression - R

More than one predictor involved in this case

	#x,y,z with random predictors
	x = 1:20 + rnorm(100,sd=3)
	y = 1:20 + rnorm(100,sd=2)
	z = 1:20 + rnorm(100,sd=3)

	#correlation between x and y
	cor(x,y)

	#correlation between y and z
	cor(y,z)

	#linear regression with two variables
	model = lm(y~x+z)

	#Equation is
	#(Intercept) x z
	# 1.8885 0.3965 0.3587
	model

	#y = 1.885 + 0.3965x + 0.3587z

	err = residuals(model)

	plot(model$fitted.values,err)

	#This will look like normal distribution
	hist(err)

view raw multivariableregression.r hosted with ❤ by GitHub

Happy Learning!!!

March 10, 2016

R Day #1 - Simple Linear Regression - Slope

UCLA Notes were very useful.

Linear Regression Model - Representing mean of response variable as function using slope and intercept parameters. Can be used for predictions. I have earlier used moving average algorithm for forecasting.

Simple Linear Regression - Explanatory variable is 1 (Dependent variable is 1)
Multivariate Linear Regression - Number of Explanatory variables more than 1

Good Summary of Data Quality Issues were summarized

Data-entry errors
Missing values
Outliers
Unusual (e.g. asymmetric) distributions
Unexpected patterns

R Cookbook had good step by step examples to try out - link

Basics Maths Again

Slope - lines rate of change in the vertical direction

y = mx + b

y = dependent variable as y depends on x
x = independent variable
m , b = characteristics of line
b = y intercept where line crosses y axis

Ref - Link

Slope = Rise / Run

= Change in y / Change in X

Equation y = x

1 = 1

2 = 2

Slope = y/x = 2/2 = 1

Slope = y2-y1 / x2-x1

Slope > 1 tilt upwards towards y axis

Slope < 1 tilt downwards towards x axis

Ref - Link

Ref - Link

Happy Learning!!!

Regression Basics

This post is on basics of Regression and Steps Involved. Linear Regression defines relationships between variables involved. We use it to identify relationships between variables.

Steps Involved

Plot line between Independent Variable in X Axis, Dependent Variable Y Axis
Identify if their positive or negative relationship (When X increases with respective to Y it is positive)
Plot a line that minimizes errors between estimates / actuals

Y = B0 + B1X (B0, B1 Derived Mathematically)
where B0 is Y Intercept, B1 is Slope

R Squared

R Squared Verification

How well regression line predicts actual values
Take Actual values (compute mean of them). Distance between actual value of mean will sum up to zero
Perfect fit R square equals 1

Standard Error of Estimates

Compare estimated values vs Actual Values
Distance between estimated and actual values

Correlation Coefficient

Fit the line
Remember slope +ve or -ve
Scatter along Y and X Axis
High Correlation means good fit

In next post we will look @ R Examples

Happy Learning!!!

March 31, 2016

March 30, 2016

March 29, 2016

March 28, 2016

March 27, 2016

March 22, 2016

March 21, 2016

March 20, 2016

March 19, 2016

March 17, 2016

March 16, 2016

March 15, 2016

March 13, 2016

March 11, 2016

March 10, 2016

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts