"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 24, 2015

t-test and z-test

Problems to workout (Good Compiled List)


Z - Scores

Z - Scores makes it easy to compare scores from distributions using different scales

Formula #1

Formula #2

Formula #3 for raw score computation is defined by 

Formula #4 for Standard Error

Trying out problems in link 

Problem 2. Suppose X is a normal random variable with a mean of 120 and a standard deviation of 20. Determine the probability that X is greater than 135.

Mean = 120
SD = 135
Z score = (135-120)/20 = 0.75

z score from attached link

Find P(Z < 0.75) = 0.7734
1 - 0.7734 = 0.2266

Problem 4. If the test scores of 400 students are normally distributed with a mean of 100 and a standard deviation of 10, approximately how many students scored between 90 and 110?
Mean = 100
SD = 10

For x = 90, z = (90-100)/10 = -1
For x = 110, z = (110-100)/10 = 1
For Z (< -1), 

= 0.1587
For Z (<1), 

= 0.8413
= 0.8413-0.1587
= 0.6826

Multiply this percentage by 400. After rounding, we get 273 students.

Problem 16. A traffic study shows that the average number of occupants in a car is 1.5 and the standard deviation is .35. In a sample of 45 cars, find the probability that the mean number of occupants is greater than 1.6.

Mean = 1.5
SD = .35

Applying Formula #2

P(mean > 1.6) = 1- P(mean < 1.6)
Z(1.6) = ((1.6-1.5)*sqrt(45)) / 0.35
          = 1.916

P(Z<1.6) =  0.9719
P(Z>1.6) = 1- 0.9719 = 0.0281

Happy Learning!!!

November 21, 2015

chi-square test for homogeneity

The chi-square test for homogeneity is a test made to determine whether several populations are similar or equal or homogeneous in some characteristics

This link was useful

I tried the problem provided in the link

Problem - Know how to compute the chi-square homegeniety test statistics.

Step 1 

Step 2

Step 3

1-pchisq(19,df=2) - R Command

Since it is less than 0.05, you reject the null hypothesis

Happy Learning!!!

Chi Square Test for Independence

  • Uses a cross classification table to examine the nature of the relationship between these variables
  • Tables are sometimes referred to as contingency tables
  • Determine variables are dependent on each other or not
  • H0: chi square test for independence is conducted by assuming that there is no relationship between the two variables
  • Ha: alternative hypothesis is that there is some relationship between the variables
The general formula for the degrees of freedom is the number of rows minus one, times the number of columns minus 1.

In terms of independence and dependence these hypotheses could be stated
  • H0 : X and Y are independent
  • H1 : X and Y are dependent
Expected Frequency = ((row total)*(column total))/Total Population

I liked the example provided in link  

Problem - Test for a Relationship between Sex and Class

X (Sex)
Y (Social Class) Male(M) Female(F) Total
Upper Middle (A) 33 29 62
Middle (B) 153 181             334
Working (C) 103 81 184
Lower (D) 16 14 30
Total 305 305            610

Table 10.12: Social Class Cross Classified by Sex of Respondents

Expected Frequency = ((row total)*(column total))/Total Population

Significance is greater than or equal to 0.05, you don't reject the null hypothesis

Results match with the problem although approach is different. The sum total sum is 610 (Total Sum)

Happy Learning!!!

Stats - Chi-Square Goodness of Fit Test

Purpose -  Test association of variables in two-way tables

The chi-square test is defined for the hypothesis:
H0: The data follow a specified distribution
Ha: The data do not follow the specified distribution
This means that if the significance value is less than 0.05, you reject the null hypothesis; if significance is greater than or equal to 0.05, you don't reject the null hypothesis

Formula is
I liked the example mentioned in notes

Problem - Testing an octadedral die to see if it is biased

Score 1 2 3 4 5 6 7 8
Frequency 7 10 11 9 12 10 14 7 (Observed)

Degree of Freedom = Number of entries - 1. Here is is 8-1 = 7
Test the hypothesis H0 - The Die is Fair
H1: Die is not fair
Significance level alpha = 0.005

Expected frequency is uniform distribution of Ei = Sum of all observed scores / 8(Number of items)
= 80/8 = 10

The expected values will be
Score 1 2 3 4 5 6 7 8
Frequency 10 10 10 10 10 10 10 10 (Expected)

To compute the score we need to find values of (Oi-Ei ), ((Oi-Ei )*(Oi-Ei ))/ Ei

For each element between  both the arrays

Compute chisquare value (R Command)

This is above significance level > 0.05. So we cannot reject null hypothesis

Answer - The Die is Fair

Happy Learning!!!

Good Read on Taylor Seris

Two summary points
  • A Taylor Series is an expansion of a function into an infinite sum of terms, like these ones
  • A derivative gives you the slope of a function at any point
Detailed Notes in link
Taylor series Formula Compilation - link

Happy Learning!!!

November 08, 2015

K Means Clustering

I'm slowly moving in Stats with a lot of learning. This post is from my class notes

K-means clustering

  • Finding groups of object similar to one another
  • Partitioning cluster approach
  • Mean moves every time (Within first few iterations it will converge)
  • Classify a given data set through a certain number of clusters
  • This does not fit well for Sparse / Dense clusters

Great 5 Minute Video

Step 1 - "Figure out centric of region"
Step 2 - "Select K Data points randomly"
Step 3 - "Assign each data point to nearest centre"
Step 4 - "Recalculate the new centroids"
Step 5 - "Repeat Step 3,4"

More Reads - K-Means Clustering

Happy Learning!!!

November 02, 2015

Quick Tip - Python Stemming Module Installation - Windows

Copy the scripts to package folder. Run the command easy_install.py specifying the package containing scripts.

Happy Learning!!!

October 31, 2015

5 minute quick learning - Naïve Bayes

Good 5 Minute Learning!!!

Conditional Probability - Easy Walkthrough

I take iterations to understand / try out a concept. Going back and learning after sometime it's interesting. This post is a quick explanation on Conditional Probability.

P(A) - Probability of Event A to occur
P(A/B) - Probability of A given that B has already occurred. - This we refer as conditional probability

Problem - Roll a fair die. Let A be event of odd outcomes. B be event where outcome <=3. What is probability of A and Probability of A given B has already occurred

A = Odd Outcomes = {1,3,5} = 3
B = Outcome <=3 = {1,2,3} = 3
Sample space = {1,2,3,4,5,6} = 6

Probability P(A) = |A| / |S| = 3/6 = 1/2

Probability P(A/B) = Probability of A given that B has already occurred

From B outcomes {1,2,3}, Possible A values are = {1,3}

p(A/B) = Events{1,3} / Events of B{1,3,5}
p(A/B)  = 2/3

Bayes Theorem

P(CD) = P(C/D)P(D)

P(CD) = P(D/C)P(C)

Equating both the formulas

P(D/C)P(C) = P(C/D)P(D)

P(D/C) = (P(C/D)P(D)) / P(C)

P(C/D) = ((P(D/C)P(C)) / P(D)

Happy 2 Minute Quick Learning!!!

September 22, 2015

R Working on Normal Distributions & Binomial Distributions

Cumulative density function - CDF is summation of all the probabilities within a range. The CDF is the integral of the PDF (Probability density function).

Read Jack Lion Heart's answer to What is the difference between a probability density function and a cumulative distribution function? on Quora
  • “d” - returns the height of the probability density function
  • “p” - returns the cumulative density function
  • “q” - returns the inverse cumulative density function (quantiles)
  • “r” - returns randomly generated numbers
Reference - Link

R Examples
Mean - 500
Variance - 100

This score 700 is better than 97% of other scores

v <- c(-1,1,2)

Link Ref

Normal distribution is defined by the following probability density function, where μ is the population mean and σ2 is the variance.

Chi-squared Distribution - If X1,X2,…,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom

Binormial Distribution - binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. 

Two Extra Parameters - number of trials and the probability of success for a single trial

Distribution function
x <- seq(0,50,by=1)
y <- dbinom(x,50,0.2)

50 - Number of Trials
0.2 - Probability of success for each trial

Cumulative Probability Density Function
x <- seq(0,50,by=1)
y <- pbinom(x,50,0.5)

50 - Number of Trials
0.5 - Probability of success for each trial

Random Probability Density Function
x <- seq(0,50,by=1)
y <- rbinom(x,50,0.5)

Happy Learning, More Learning Needed...It's vast....Lot more efforts needed :)

September 18, 2015

Maths - Basics

Good Read

Happy Learning!!!

Class Notes - Information Retrieval

Information Retrieval Notes
  • Document Corpus (Collection of links / indexed web structures)
Examples of Information Retrieval
  • When user is humming songs, based on it if we identify song then its classified as IR problem
  • Multimedia IR (musics, video, analysing music videos)
  • Photo Search (Visual IR)
Information Retrieval Application Areas
  • Text Information Retrieval
  • Web Search
  • Social Media Search
  • Micro Blogs
  • Twitter Blogs
Boolean Information Retrieval
  • Simplest model
  • Restricted Queries
  • queries are boolean expressions
Inverted Index
  • For each item we have a list
  • Like index of a book (Topic, Pagenumber), Close to glossary
  • Document, Tokenize the text
  • Inside document order tokens, heuristics to combine multiple tokens for index construction
  • Document Frequency - How many times term appears
  • Ordering (Right to Left)
  • Proximity Search Leveraging the context)
  • Encoding 
  • Normalization and keyword detection based on locale
  • Accents, patterns
  • Stemming (Chopping end of words to obtain root word)
  • Porter algorithm for stemming
  • Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
  • Skip pointers(Find elements common between both lists, increment until they match)
Happy Learning!!!

September 13, 2015

Central Limit Theorem

Normal Distribution

Standard Normal Distribution - Mean = 0, Variance = 1

Distribution approaches to normal distribution for larger set of variables. 

"As n increases, the distribution of sample mean approaches normal distribution"

Central Limit Theorem - Almost all measurable "random" variables in real world follow some kind of normal distribution.

Good Link

"Sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger"

"Average of your sample means will be the population mean"

Happy Learning!!!

September 09, 2015

Linear Algebra Playlist

Linear Algebra and Calculus basics Playlist bookmarked based on reference from my colleague

Pauls Online Calculus Notes

MIT Linear Algebra Playlist

Linear Algebra


Linear Regression Basics

To Understand Linear Regression basics of Slope, Correlation was useful.

Slope Revision

Finding the slope of a line from its graph: Slope of a line

Slope = Change in Y / Change in X
Slope is constant for a line

Simple Linear Regression

  • One Explanatory Variable Simple Regression
  • More than one Explanatory Variable multiple Regression
  • X - Independant Variable(Explanatory), Y - Dependant Variable (Response)
  • Good fitting line (Reasonable for predicting relationship) - Measure of Strength of Relationship (Co-relation)
  • Correlation - A & B are observed at Same time
  • Methods of Least Squares to estimate B0 and B1
  • Residual = Observed - Predicted value (Above Line +ve, Below Line -ve)
Reference Videos

September 07, 2015

Covariance and Correlation - Random Variable - Probability

This video was useful to understand the covariance and correlation relationship

Happy Learning!!!

Working with R - InterQuartile Range

Concept - IQR - InterQuartile Range

IQR = Q3 - Q1 = 3rd Quartile - 1st Quartile
  • Median - Arrange data from lowest to highest
  • On Even dataset - Average of two most middle numbers
  • On Odd dataset - Single Number that is halfway into the set
Dataset - 5,6,12,13,15,18,22,50

Q2 = (13+15)/2 = 14 - Median of Data Value

Q1 = (6+12)/2 = 9 - Median Before Q2

Q3 = (18+22)/2 = 20 - Median After Q2

IQR = Q3-Q1 = 20-9 = 11

BoxPlot is used to identify outliers

For Above Dataset
  • Minimum Value - 5
  • Q1 - 9
  • Q2 - 14
  • Q3 - 20
  • Maximum Value - 50
This is the mathematical concept. This is used for finding outliers.

Outlier - Much larger or smaller than other values in data set. IQR obtained by subtracting third vs first quartile. 

Finding Outliers
1. Any value < Q1-1.5(IQR) or > Q3+1.5(IQR) is an outlier
2. Any Value < (9-1.5(11)) = -7.5
    Any Value > 20+1.5(11) = 20+16.5 = 36.5

This Video was useful to understand the concept before trying out in R

Computing using R

IQR between 25th percentile and the 75th

dataset <-c( 5,6,12,13,15,18,22,50 )
quantile(x=dataset, probs= c(.25,.75))

Sample Output

Outlier highlighted in circles

Happy Learning!!!

September 06, 2015

Class 3 - Statistics Notes

This was mostly on probability distribution functions. Couple of one liners from session

Conditional Probability Distribution - Value of one random variable not impacting another then random variables are independent

Variance - How many values < mean and > mean

Covariance - If X and Y are two independent variables Covariance is zero
Correlation Values between -1 and 1
Probability "Chebyshev Inequality"
Chebyshev Inequality - For computing mean for specific region (Integrated over smaller region). Probability in tail for any random variable.

Related Reads
Happy Learning!!!