March 31, 2016
March 30, 2016
Good Read - Winning Data Science Competitions
Interviews
Questions From Data Science Interviews
Let your work do the talking
Highly Creative Person - More(['work','attention','energy]') -> ['More Benefit']
Interesting Read - Data Science & Applications
Happy Reading!!!A @Kaggle winner discusses his #MachineLearning secrets: https://t.co/OCIhKyuBRK #abdsc #BigData #DataScience pic.twitter.com/92dVvnaa13
— Kirk Borne (@KirkDBorne) March 21, 2016
Labels:
Data Science Tips
March 29, 2016
Good Data Science Tech Talk
Today I spent some time with Tech talk on predictive modelling. Good Coverage of fundamentals, Needs revision again.
Read William Chen's answer to What are the most common mistakes made by aspiring data scientists? on Quora
Happy Learning!!!!
Read William Chen's answer to What are the most common mistakes made by aspiring data scientists? on Quora
Happy Learning!!!!
Labels:
Data Science Tips
March 28, 2016
Data Science Day #12 - Text Processing in R
Today's post is basics on text processing using R. We look at removing stop words, numbers, punctuations, lower case conversion etc..
Happy Learning!!!
Happy Learning!!!
Labels:
Data Science Tips
March 27, 2016
Day #11 - Data Science Learning Notes - Evaluating a model
R Square - Goodness of Fit Test
- R square = (1- (Sum of Squares of Error/Sum of Squares of Total))
- SST - Variance of dependent variable
- SSE - Variance of Actual vs Predicted Values
- Adjusted R Square = (1-((n-1)/(n-p-1)))(1-RSquare)
- P - Number of independent variables
- n - records in dataset
- For every record predicted compute error
- Square it and find mean
- RMSE error should be same for training and testing dataset
- Model can't explain the dataset
- R Square value very less
- Add more Independent variable
- RMSE High for test dataset, RMSE low for training dataset
- Cut down Independent variable
- Conduct P test to validate null hypothesis is valid
- Subset Selection Technique
- Cross Validation Technique
- Z test / P Test
Labels:
Data Science Tips
March 22, 2016
Day #10 Data Science Learning - Correlations
Correlation
Causation - Reason for change in value (Cholesterol vs weight, Dress Size Vs Cholesterol). Identify if it is incidental.
Happy Learning!!!
- If you have correlation you can use machine learning to predict variables
- Mutual relationship connection between two or more things
- Correlation shows inter dependence between two variables
- Measure - How much one changes when other also changes ?
- Popularly Used - Pearson Correlation coefficient
- Value ranges from -1 to +1
- Negative correlation (Closer to -1) - One value goes up other goes down
- Closer to Zero (No Correlation)
- Closer to 1 (Positive Correlation)
Causation - Reason for change in value (Cholesterol vs weight, Dress Size Vs Cholesterol). Identify if it is incidental.
Handling highly correlated variables
- Remove two correlated variables, unless correlation=1 or -1, in which case one of the variables is redundant.
- Perform a PCA
- Permutation feature importance (Link)
- Greedy Elimination (GE): iteratively eliminate feature of highest correlated feature pair
- Recursive Feature Elimination (RFE): recursive elimination of features with respect to their importance
- Lasso Regulariosion (LR): use L1 regularisation to remove features with zero weight
- Principle Component Analysis (PCA): transform data set with PCA and choose components with highest variations
Labels:
Correlation,
Data Science Tips
March 21, 2016
March 20, 2016
Data Science Tip Day #8 - Dealing with Skewness for Error histograms after Linear Regression
In previous posts we have seen histogram based validation for errors. When a left / right skew based distribution is observed some transformation techniques to apply are
- Right Skewed - Apply Log
- Slightly Right - Square root
- Left Skewed - Exponential
- Slightly Left - Square root, Cube
Labels:
Data Science Tips
March 19, 2016
Data Science Tip Day#7 - Interaction Variables
This post is using interaction variables while performing linear regression
For illustration purpose lets construct some datasets with a three vectors (y,x,z)
Happy Learning!!!
For illustration purpose lets construct some datasets with a three vectors (y,x,z)
Happy Learning!!!
Labels:
Data Science Tips
Data Science Tip Day#6 - Linear Regression, Polynomial Regression
We will be renaming the R tips to Data Science Tips moving on. Today we will look at Linear Regression, Polynomial & Variable Interactions. Today's stats class was very useful and interesting. Any topic of interest needs practise to master the fundamentals.
Linear Regression
Assume a Linear Equation y = mx+b
- Here m is slope, b is intercept value at x = 0
- Similarly, we use the equation to express relationship between dependent and independent variables
- In this equation y = mx+b, y is the dependent variable. x is the independent variable
Happy Learning!!!
Labels:
Data Science Tips
March 17, 2016
R Day #5 Tip of the Day - Linear Regression
I have taken up Udemy Data Science Classes. Below notes from Linear Regression Classes
Linear Regression
Linear Regression
- Analyze relationship between two or multiple variables
- Goal is preparing equation (Outcome y, Predictors x)
- Estimate value of dependent and independent variables using relationship equations
- Used for Continuous variables that have some correlation between them
- Goodness of fit test to validate model
- Explains relationship between two variables
- X (Independent- Predictor), Y (Dependent)
- Y = AX + B
- A (Slope) = (Y/X)
- B - Intercept (Value of Y when X =0)
- Equation becomes predictor of Y
- Sum of squares of vertical distances minimal
- Best Line = least residual
- Difference between model and actual values are called as residuals
- R square measure
- Sum of squares of distances (Sum of squares of vertical distances minimal)
- Uses residual values
- Higher R square value better the fit (close to 1 higher fit)
- Higher Correlation means better fit (R square will also be high)
- Multiple predictors involved (X Values)
- More than one independent variable used to predict dependent variable
- Y = A1X1 + A2X2 + A3X3 +ApXp + B
Homoscedasticity - all random variables in the sequence or vector have the same finite variance
heteroscedasticity - variability of a variable is unequal across the range of values of a second variable that predicts it
Pearson's correlation coefficient (r) is a measure of the strength of the association between the two variables
Happy Learning!!!
Labels:
R Tips
March 16, 2016
R Day #4 - Tip for the Day - Learning Resources
I came across this site NYU Stats R Site
Please bookmark it for a good walkthru and learning on R Core topics
Happy Learning!!!
Labels:
R Tips
March 15, 2016
Good Reads - Data Science
Read Joaquin Quiñonero Candela's answer to In applied Machine Learning what is more important: data, infrastructure, or algorithms? on Quora
Read Joaquin Quiñonero Candela's answer to What do you look for when hiring someone for your team? on Quora
Happy Learning!!!
Read Joaquin Quiñonero Candela's answer to What do you look for when hiring someone for your team? on Quora
Happy Learning!!!
Labels:
Good Reads
March 13, 2016
R Day #3 - Tip for the Day
This is based on reading from notes from link
Logistic Regression
Poisson Regression
Applied for below situations
Logistic Regression
- Applied when response is binary
- (0/1, yes/No etc..), Also known as dichotomous outcome variable
- consists of (i) n independent trials where
- (ii) each trial results in one of two possible outcomes (Yes/No, 1/0)
- (iii) the probability p of a success stays the same for each trial
Poisson Regression
Applied for below situations
- The occurrences of the event of interest in non-overlapping “time” intervals are independent
- The probability two or more events in a small time interval is small, and
- The probability that an event occurs in a short interval of time is proportional to the length of the time interval
- Heteroscedasticity - means unequal error variances
- The Poisson model does not always provide a good fit to a count response.
- An alternative model is the negative binomial distribution
Labels:
R Tips
March 11, 2016
Day #2 - Multivariate Linear Regression - R
- More than one predictor involved in this case
Happy Learning!!!
Labels:
R Tips
March 10, 2016
R Day #1 - Simple Linear Regression - Slope
UCLA Notes were very useful.
Linear Regression Model - Representing mean of response variable as function using slope and intercept parameters. Can be used for predictions. I have earlier used moving average algorithm for forecasting.
Happy Learning!!!
Linear Regression Model - Representing mean of response variable as function using slope and intercept parameters. Can be used for predictions. I have earlier used moving average algorithm for forecasting.
- Simple Linear Regression - Explanatory variable is 1 (Dependent variable is 1)
- Multivariate Linear Regression - Number of Explanatory variables more than 1
- Data-entry errors
- Missing values
- Outliers
- Unusual (e.g. asymmetric) distributions
- Unexpected patterns
Basics Maths Again
Slope - lines rate of change in the vertical direction
y = mx + b
- y = dependent variable as y depends on x
- x = independent variable
- m , b = characteristics of line
- b = y intercept where line crosses y axis
Ref - Link
Slope = Rise / Run
= Change in y / Change in X
Equation y = x
1 = 1
2 = 2
Slope = y/x = 2/2 = 1
Slope = y2-y1 / x2-x1
Slope > 1 tilt upwards towards y axis
Slope < 1 tilt downwards towards x axis
Ref - Link
Labels:
Forecasting,
R Tips
Regression Basics
This post is on basics of Regression and Steps Involved. Linear Regression defines relationships between variables involved. We use it to identify relationships between variables.
Steps Involved
where B0 is Y Intercept, B1 is Slope
R Squared
R Squared Verification
Standard Error of Estimates
Correlation Coefficient
Steps Involved
- Plot line between Independent Variable in X Axis, Dependent Variable Y Axis
- Identify if their positive or negative relationship (When X increases with respective to Y it is positive)
- Plot a line that minimizes errors between estimates / actuals
where B0 is Y Intercept, B1 is Slope
R Squared
R Squared Verification
- How well regression line predicts actual values
- Take Actual values (compute mean of them). Distance between actual value of mean will sum up to zero
- Perfect fit R square equals 1
Standard Error of Estimates
- Compare estimated values vs Actual Values
- Distance between estimated and actual values
Correlation Coefficient
- Fit the line
- Remember slope +ve or -ve
- Scatter along Y and X Axis
- High Correlation means good fit
In next post we will look @ R Examples
Happy Learning!!!
Labels:
R
Subscribe to:
Posts (Atom)