"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 31, 2016

March 30, 2016

Good Read - Winning Data Science Competitions






Interviews



Questions From Data Science Interviews

Let your work do the talking
Highly Creative Person - More(['work','attention','energy]') -> ['More Benefit']

Happy Learning!!!

Interesting Read - Data Science & Applications

Happy Reading!!!

March 29, 2016

Good Data Science Tech Talk

Today I spent some time with Tech talk on predictive modelling. Good Coverage of fundamentals, Needs revision again.


Read William Chen's answer to What are the most common mistakes made by aspiring data scientists? on Quora

Happy Learning!!!!

March 28, 2016

Data Science Day #12 - Text Processing in R

Today's post is basics on text processing using R. We look at removing stop words, numbers, punctuations, lower case conversion etc..


Happy Learning!!!

March 27, 2016

Day #11 - Data Science Learning Notes - Evaluating a model

R Square - Goodness of Fit Test
  • R square = (1- (Sum of Squares of Error/Sum of Squares of Total))
  • SST - Variance of dependent variable
  • SSE - Variance of Actual vs Predicted Values
Adjusted R Square 
  • Adjusted R Square = (1-((n-1)/(n-p-1)))(1-RSquare)
  • P - Number of independent variables
  • n - records in dataset
RMSE (Root mean square error)
  • For every record predicted compute error 
  • Square it and find mean
  • RMSE error should be same for training and testing dataset
Bias (Underfit)
  • Model can't explain the dataset
  • R Square value very less
  • Add more Independent variable
Variance
  • RMSE High for test dataset, RMSE low for training dataset
  • Cut down Independent variable
Collinearity Problem
  • Conduct P test to validate null hypothesis is valid
Next Pending Reads
  • Subset Selection Technique
  • Cross Validation Technique
  • Z test / P Test
Happy Learning!!!

March 22, 2016

Day #10 Data Science Learning - Correlations

Correlation
  • If you have correlation you can use machine learning to predict variables
  • Mutual relationship connection between two or more things
  • Correlation shows inter dependence between two variables
  • Measure - How much one changes when other also changes ?
  • Popularly Used - Pearson Correlation coefficient
  • Value ranges from -1 to +1
  • Negative correlation (Closer to -1) - One value goes up other goes down
  • Closer to Zero (No Correlation)
  • Closer to 1 (Positive Correlation)
Correlation - Relationship between two values
Causation - Reason for change in value (Cholesterol vs weight, Dress Size Vs Cholesterol). Identify if it is incidental.

Handling highly correlated variables
  • Remove two correlated variables, unless correlation=1 or -1, in which case one of the variables is redundant.
  • Perform a PCA
  • Permutation feature importance (Link)
  • Greedy Elimination (GE): iteratively eliminate feature of highest correlated feature pair
  • Recursive Feature Elimination (RFE): recursive elimination of features with respect to their importance
  • Lasso Regulariosion (LR): use L1 regularisation to remove features with zero weight
  • Principle Component Analysis (PCA): transform data set with PCA and choose components with highest variations
Ref  - Link1Link2 , Link3

Happy Learning!!!

March 20, 2016

Data Science Tip Day #8 - Dealing with Skewness for Error histograms after Linear Regression

In previous posts we have seen histogram based validation for errors. When a left / right skew based distribution is observed some transformation techniques to apply are
  • Right Skewed - Apply Log
  • Slightly Right - Square root
  • Left Skewed - Exponential
  • Slightly Left - Square root, Cube
Happy Learning!!!

March 19, 2016

Data Science Tip Day#7 - Interaction Variables

This post is using interaction variables while performing linear regression

For illustration purpose lets construct some datasets with a three vectors (y,x,z)



Happy Learning!!!

Data Science Tip Day#6 - Linear Regression, Polynomial Regression


We will be renaming the R tips to Data Science Tips moving on. Today we will look at Linear Regression, Polynomial & Variable Interactions. Today's stats class was very useful and interesting. Any topic of interest needs practise to master the fundamentals.

Linear Regression
Assume a Linear Equation y = mx+b
  • Here m is slope, b is intercept value at x = 0
  • Similarly, we use the equation to express relationship between dependent and independent variables
  • In this equation y = mx+b, y is the dependent variable. x is the independent variable
For illustration purpose let's construct some datasets with a two vectors (y,x)




Happy Learning!!!

March 17, 2016

R Day #5 Tip of the Day - Linear Regression

I have taken up Udemy Data Science Classes. Below notes from Linear Regression Classes

Linear Regression 
  • Analyze relationship between two or multiple variables
  • Goal is preparing equation (Outcome y, Predictors x)
  • Estimate value of dependent and independent variables using relationship equations
  • Used for Continuous variables that have some correlation between them
  • Goodness of fit test to validate model
Linear Equations
  • Explains relationship between two variables
  •  X (Independent- Predictor), Y (Dependent)
  •  Y = AX + B
  •  A (Slope) = (Y/X)
  •  B - Intercept (Value of Y when X =0)
  •  Equation becomes predictor of Y
Fitting Line
  • Sum of squares of vertical distances minimal
  • Best Line = least residual
  • Difference between model and actual values are called as residuals
Goodness of Fit
  • R square measure 
  • Sum of squares of distances (Sum of squares of vertical distances minimal)
  • Uses residual values
  • Higher R square value better the fit (close to 1 higher fit)
  • Higher Correlation means better fit (R square will also be high) 
Multiple Regression
  • Multiple predictors involved (X Values)
  • More than one independent variable used to predict dependent variable
  • Y = A1X1 + A2X2 + A3X3 +ApXp + B

Homoscedasticity -  all random variables in the sequence or vector have the same finite variance
heteroscedasticity -  variability of a variable is unequal across the range of values of a second variable that predicts it
Pearson's correlation coefficient (r) is a measure of the strength of the association between the two variables

Happy Learning!!!

March 16, 2016

R Day #4 - Tip for the Day - Learning Resources


I came across this site NYU Stats R Site

Please bookmark it for a good walkthru and learning on R Core topics

Happy Learning!!!

March 13, 2016

R Day #3 - Tip for the Day

This is based on reading from notes from link 

Logistic Regression
  • Applied when response is binary
  • (0/1, yes/No etc..), Also known as dichotomous outcome variable 
Binomial probability model
  • consists of (i) n independent trials where 
  • (ii) each trial results in one of two possible outcomes (Yes/No, 1/0)
  • (iii) the probability p of a success stays the same for each trial
Maximum likelihood - Find the value of the parameter(s) (in this case p) which makes the observed data most likely to have occurred

Poisson Regression
Applied for below situations
  • The occurrences of the event of interest in non-overlapping “time” intervals are independent
  • The probability two or more events in a small time interval is small, and
  • The probability that an event occurs in a short interval of time is proportional to the length of the time interval
  • Heteroscedasticity - means unequal error variances
Negative Binomial Model
  • The Poisson model does not always provide a good fit to a count response. 
  • An alternative model is the negative binomial distribution
Happy Learning!!!

March 11, 2016

Day #2 - Multivariate Linear Regression - R

  • More than one predictor involved in this case

Happy Learning!!!

March 10, 2016

R Day #1 - Simple Linear Regression - Slope

UCLA Notes were very useful.

Linear Regression Model - Representing mean of response variable as function using slope and intercept parameters. Can be used for predictions. I have earlier used moving average algorithm for forecasting.
  • Simple Linear Regression - Explanatory variable is 1 (Dependent variable is 1)
  • Multivariate Linear Regression - Number of Explanatory variables more than 1
Good Summary of Data Quality Issues were summarized
  • Data-entry errors
  • Missing values
  • Outliers
  • Unusual (e.g. asymmetric) distributions
  • Unexpected patterns
R Cookbook had good step by step examples to try out - link

Basics Maths Again

Slope - lines rate of change in the vertical direction

y = mx + b
  • y = dependent variable as y depends on x
  • x = independent variable
  • m , b = characteristics of line
  • b = y intercept where line crosses y axis
Ref - Link

Slope     = Rise / Run
              = Change in y / Change in X

Equation y = x
1 = 1
2 = 2

Slope = y/x = 2/2 = 1
Slope = y2-y1 / x2-x1

Slope > 1 tilt upwards towards y axis
Slope < 1 tilt downwards towards x axis




Ref - Link


Ref - Link

Happy Learning!!!

Regression Basics

This post is on basics of Regression and Steps Involved. Linear Regression defines relationships between variables involved. We use it to identify relationships between variables.

Steps Involved
  • Plot line between Independent Variable in X Axis, Dependent Variable Y Axis
  • Identify if their positive or negative relationship (When X increases with respective to Y it is positive)
  • Plot a line that minimizes errors between estimates / actuals
Y = B0 + B1X (B0, B1 Derived Mathematically)
where B0 is Y Intercept, B1 is Slope

R Squared 

R Squared Verification 
  • How well regression line predicts actual values
  • Take Actual values (compute mean of them). Distance between actual value of mean will sum up to zero
  • Perfect fit R square equals 1


Standard Error of Estimates
  • Compare estimated values vs Actual Values
  • Distance between estimated and actual values

Correlation Coefficient
  • Fit the line
  • Remember slope +ve or -ve
  • Scatter along Y and X Axis
  • High Correlation means good fit

In next post we will look @ R Examples

Happy Learning!!!