"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;
Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

March 10, 2016

Regression Basics

This post is on basics of Regression and Steps Involved. Linear Regression defines relationships between variables involved. We use it to identify relationships between variables.

Steps Involved
  • Plot line between Independent Variable in X Axis, Dependent Variable Y Axis
  • Identify if their positive or negative relationship (When X increases with respective to Y it is positive)
  • Plot a line that minimizes errors between estimates / actuals
Y = B0 + B1X (B0, B1 Derived Mathematically)
where B0 is Y Intercept, B1 is Slope

R Squared 

R Squared Verification 
  • How well regression line predicts actual values
  • Take Actual values (compute mean of them). Distance between actual value of mean will sum up to zero
  • Perfect fit R square equals 1


Standard Error of Estimates
  • Compare estimated values vs Actual Values
  • Distance between estimated and actual values

Correlation Coefficient
  • Fit the line
  • Remember slope +ve or -ve
  • Scatter along Y and X Axis
  • High Correlation means good fit

In next post we will look @ R Examples

Happy Learning!!!

February 29, 2016

Naive Bayes Classifier

Naive Based Classifier Notes and Examples

  • Work on assumption occurrence of word i is not dependent on occurrence of word i+1
  • Usually a sentence will have context only when words occur with appropriate terms and positions
  • For example purpose, we have listed below two classes and a testing document to classify the same













Ref - Link

Happy Learning!!!

February 23, 2016

Hierarchical Clustering


  • Compute distance in every pair of cluster
  • Merge nearest ones until number of clusters = number of clusters needed
  • Entire process can be represented as dendrogram
  • At the end of the algorithm dendogram is plotted
Measuring Distance between clusters
  • Single (Minimum Distance between two pairs one from each clusters)
  • Complete (Maximum  between two pairs one from each clusters)
  • Average (Average of all possible pairs)

Happy Learning!!!

K-medoids, K-means

Great Learning and lot of revisions needed to really deep dive and understand the fundamentals.

K-means
  • Prone to outliers (Squared Euclidean gives greater weight to more distant points)
  • Can't handle categorical data
  • Work with Euclidean only
K-Medoids
  • Restrict centre to data points
  • Centre picked up only from data points
  • We use same sum of squares for cost function but distance is not Euclidean distance
  • Use your own custom distance functions when involved with numerical and categorical variables
  • Example (25 languages, 24 columns, M/F/N - 2 columns) - Compute your own custom distance functions. It is one less because all zero combinations will also be treated as one attribute
Distance measure for numerical variables
  • Euclidean based distance
  • Correlation based distance
  • Mahalanobis distance
Distance measure for category variables
  • Matching coef and Jaquard’s coef

Happy Learning!!!

February 22, 2016

R and SQL Server

This post is example for querying SQL Server and visualizing data using twitter. Package used is ROBDC. Sample walk through code snippet provided.

Happy Learning!!!

February 19, 2016

January 02, 2016

R + Stats

The Following course material is very useful for R + Stats Combinations. It's a great material for R learning. Captured below are notes from 5,6,7,8 chapters

What is a central limit theorem?

The central limit theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal. In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped

Binomial Probability - Only two mutually exclusive events often referred as success, failure. Also called bernouli trial (Link )
R commands - The dbinom and pbinom functions

Discrete Probability Distributions

R command - pnorm
Command Syntax - pnorm(x, mean = , sd = , lower.tail= )

Two-Tailed Tests - Testing for the possibility of the relationship in both directions. This means that .025 is in each tail of the distribution

One-Tailed Tests - one-tailed test allots all of your alpha to testing the statistical significance in the one direction of interest. This means that .05 is in one tail of the distribution of your test statistic.

Alternative hypothesis has the > operator, right-tailed test 
Right-Tailed Tests: P-value = pnorm(zx¯, lower.tail=FALSE)

Alternative hypothesis has the < operator, left-tailed test 
Left-Tailed Tests: P-value = pnorm(zx¯, lower.tail=TRUE)

Alternative hypothesis has the ≠ operator, two-tailed (left and right) test
Two-Tailed Tests: P-value = 2 * pnorm( abs(zx¯), lower.tail=FALSE)

pnorm(x, µ, σ), 
  • x is an observation from a normal distribution 
  • mean µ 
  • standard deviation σ
Computing P value from t value 
pt(abs(t-value), df=degree of freedom)

Reference

Happy Learning!!!