## November 24, 2015

###
Z - Scores

Z - Scores makes it easy to compare scores from distributions using different scales

*Formula #1*

Formula #3 for raw score computation is defined by

__Problem 2__. Suppose X is a normal random variable with a mean of 120 and a standard deviation of 20. Determine the probability that X is greater than 135.

## November 21, 2015

###
chi-square test for homogeneity

###
Chi Square Test for Independence

**Approach**

**The general formula for the degrees of freedom is the number of rows minus one, times the number of columns minus 1.**

In terms of independence and dependence these hypotheses could be stated

**Expected Frequency = ((row total)*(column total))/Total Population**

**Problem - Test for a Relationship between Sex and Class**

**X (Sex)**

**Expected Frequency = ((row total)*(column total))/Total Population**

###
Stats - Chi-Square Goodness of Fit Test

**Purpose** - Test association of variables in two-way tables

The chi-square test is defined for the hypothesis:

**H0:** The data follow a specified distribution

**Ha:** The data do not follow the specified distribution

This means that if the significance value is**less than 0.05, you reject the null hypothesis**; if significance is **greater than or equal to 0.05, you don't reject the null hypothesis**

Formula is

**Problem - Testing an octadedral die to see if it is biased**

**Degree of Freedom = Number of entries - 1. Here is is 8-1 = 7**

**Expected frequency is uniform distribution of Ei = Sum of all observed scores / 8(Number of items)**

**The expected values will be**

**For each element between both the arrays**
** 1-pchisq(4,df=7)**
** 0.7797774**

**Answer - The Die is Fair**

Happy Learning!!!

###
Good Read on Taylor Seris

**Two summary points**

Taylor series Formula Compilation - link

Happy Learning!!!
## November 19, 2015

###
Good Formula Lists Compilations

## November 08, 2015

###
K Means Clustering

I'm slowly moving in Stats with a lot of learning. This post is from my class notes

**K-means clustering**

__Great 5 Minute Video__

Step 1 - "Figure out centric of region"

Step 2 - "Select K Data points randomly"

Step 3 - "Assign each data point to nearest centre"

Step 4 - "Recalculate the new centroids"

Step 5 - "Repeat Step 3,4"

**More Reads** - K-Means Clustering

Happy Learning!!!
## November 02, 2015

###
Quick Tip - Python Stemming Module Installation - Windows

Copy the scripts to package folder. Run the command easy_install.py specifying the package containing scripts.

Happy Learning!!!
## October 31, 2015

###
Conditional Probability - Easy Walkthrough

## September 22, 2015

###
R Working on Normal Distributions & Binomial Distributions

**Cumulative density function** - CDF is summation of all the probabilities within a range. The CDF is the integral of the PDF (Probability density function).

Read Jack Lion Heart's answer to What is the difference between a probability density function and a cumulative distribution function? on Quora

**R Examples**

**Normal distribution **is defined by the following probability density function, where μ is the population mean and σ2 is the variance.

**Chi-squared Distribution** - If X1,X2,…,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom

**Binormial Distribution** - binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment.

**Two Extra Parameters** - number of trials and the probability of success for a single trial

__Distribution function__

__Cumulative Probability Density Function__

**Random Probability Density Function**

Happy Learning, More Learning Needed...It's vast....Lot more efforts needed :)
## September 18, 2015

###
Class Notes - Information Retrieval

**Information Retrieval Notes**

**Examples of Information Retrieval**

**Information Retrieval Application Areas**

**Boolean Information Retrieval**

**Inverted Index**

**Challenges**

Happy Learning!!!
## September 13, 2015

###
Central Limit Theorem

**Normal Distribution**

**Standard Normal Distribution - Mean = 0, Variance = 1**

**Distribution approaches to normal distribution for larger set of variables. **

**"As n increases, the distribution of sample mean approaches normal distribution"**

**Central Limit Theorem **- Almost all measurable "random" variables in real world follow some kind of normal distribution.

Good Link

**"Sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger**"

**"Average of your sample means will be the population mean"**

Happy Learning!!!
## September 09, 2015

###
Linear Algebra Playlist

###
Linear Regression Basics

## September 07, 2015

###
Working with R - InterQuartile Range

**Concept - IQR - InterQuartile Range**

**Dataset** - 5,6,12,13,15,18,22,50

**Q2 =** (13+15)/2 = 14 **- Median of Data Value**

**Q1 = **(6+12)/2 = 9 **- Median Before Q2**

**Q3 =** (18+22)/2 = 20 **- Median After Q2**

**This is the mathematical concept. This is used for finding outliers.**

**Outlier - **Much larger or smaller than other values in data set. IQR obtained by subtracting third vs first quartile.

**Finding Outliers**
**1. Any value < Q1-1.5(IQR) or > Q3+1.5(IQR) **is an outlier
**Any Value > 20+1.5(11) = 20+16.5 = 36.5**

This Video was useful to understand the concept before trying out in R

## September 06, 2015

###
Class 3 - Statistics Notes

Formula #4 for Standard Error

Trying out problems in link

Find P(Z < 0.75) = 0.7734

1 - 0.7734 = 0.2266

__Problem 4.__ If the test scores of 400 students are normally distributed with a mean of 100 and a standard deviation of 10, approximately how many students scored between 90 and 110?

Mean = 100

SD = 10

For x = 90, z = (90-100)/10 = -1

For x = 110, z = (110-100)/10 = 1

**For Z (< -1), **

= 0.1587

**For Z (<1), **

= 0.8413

= 0.8413-0.1587

= 0.6826

**Multiply this percentage by 400. After rounding, we get 273 students.**

**Problem 16.** A traffic study shows that the average number of occupants in a car is 1.5 and the standard deviation is .35. In a sample of 45 cars, find the probability that the mean number of occupants is greater than 1.6.

Mean = 1.5

SD = .35

__Applying Formula #2__

P(mean > 1.6) = 1- P(mean < 1.6)

Z(1.6) = ((1.6-1.5)*sqrt(45)) / 0.35

= 1.916

P(Z<1.6) = 0.9719

P(Z>1.6) = 1- 0.9719 = 0.0281

Happy Learning!!!

The chi-square test for homogeneity is a test made to determine whether several populations are similar or equal or homogeneous in some characteristics

This link was useful

I tried the problem provided in the link

- Uses a cross classification table to examine the nature of the relationship between these variables
- Tables are sometimes referred to as contingency tables
- Determine variables are dependent on each other or not

- H0: chi square test for independence is conducted by assuming that there is
**no relationship between the two variables** - Ha: alternative hypothesis is that there is some relationship between the variables

I liked the example provided in link

Y (Social Class) Male(M) Female(F) Total

Upper Middle (A) 33 29 62

Middle (B) 153 181 334

Working (C) 103 81 184

Lower (D) 16 14 30

Total 305 305 610

Table 10.12: Social Class Cross Classified by Sex of Respondents

1-pchisq(4.8748,df=3)

0.1811978

Significance is greater than or equal to 0.05, you don't reject the null hypothesis

Results match with the problem although approach is different. The sum total sum is 610 (Total Sum)

Happy Learning!!!

I liked the example mentioned in notes

Score 1 2 3 4 5 6 7 8

Frequency 7 10 11 9 12 10 14 7 (Observed)

Test the hypothesis H0 - The Die is Fair

H1: Die is not fair

Significance level alpha = 0.005

= 80/8 = 10

Score 1 2 3 4 5 6 7 8

Frequency 10 10 10 10 10 10 10 10 (Expected)

To compute the score we need to find values of (Oi-Ei ), ((Oi-Ei )*(Oi-Ei ))/ Ei

Compute chisquare value (R Command)

This is above significance level > 0.05. So we cannot reject null hypothesis

- A Taylor Series is an expansion of a function into an infinite sum of terms, like these ones
- A derivative gives you the slope of a function at any point

Good Formula Lists Compilations

Happy Learning!!!

- Integrals
- Stats
- Mathematics - Rotations
- Partial Derivatives
- Differential Equations
- Laplace Transform
- Elementary Differential and Integral Calculus
- More Reads

- Finding groups of object similar to one another
- Partitioning cluster approach
- Mean moves every time (Within first few iterations it will converge)
- Classify a given data set through a certain number of clusters
- This does not fit well for Sparse / Dense clusters

I take iterations to understand / try out a concept. Going back and learning after sometime it's interesting. **This post is a quick explanation on Conditional Probability.**

P(A) - Probability of Event A to occur

P(A/B) - Probability of A given that**B has already occurred. - **This we refer as conditional probability

**Problem **- Roll a fair die. Let A be event of odd outcomes. B be event where outcome <=3. What is probability of A and Probability of A given B has already occurred

A = Odd Outcomes = {1,3,5} = 3

B = Outcome <=3 = {1,2,3} = 3

Sample space = {1,2,3,4,5,6} = 6

Probability P(A) = |A| / |S| = 3/6 = 1/2

**Probability P(A/B) = Probability of A given that B has already occurred**

From B outcomes {1,2,3}, Possible A values are = {1,3}

p(A/B) = Events{1,3} / Events of B{1,3,5}

**p(A/B) = 2/3**

**Bayes Theorem**

P(CD) = P(C/D)P(D)

P(CD) = P(D/C)P(C)

**Equating both the formulas**

P(D/C)P(C) = P(C/D)P(D)

**P(D/C) = (P(C/D)P(D)) / P(C)**

**P(C/D) = ((P(D/C)P(C)) / P(D)**

Happy 2 Minute Quick Learning!!!

- “d” - returns the height of the probability density function
- “p” - returns the cumulative density function
- “q” - returns the inverse cumulative density function (quantiles)
- “r” - returns randomly generated numbers

pnorm(700,500,100)

Mean - 500

Variance - 100

This score 700 is better than 97% of other scores

v <- c(-1,1,2)

dnorm(v)

plot(v,dnorm(v))

plot(v,pnorm(v))

dnorm(0,mean(v),sd(v))

Link Ref

x <- seq(0,50,by=1)

y <- dbinom(x,50,0.2)

plot(x,y)

50 - Number of Trials

0.2 - Probability of success for each trial

x <- seq(0,50,by=1)

y <- pbinom(x,50,0.5)

plot(x,y)

50 - Number of Trials

0.5 - Probability of success for each trial

x <- seq(0,50,by=1)

y <- rbinom(x,50,0.5)

plot(x,y)

- Document Corpus (Collection of links / indexed web structures)

- When user is humming songs, based on it if we identify song then its classified as IR problem
- Multimedia IR (musics, video, analysing music videos)
- Photo Search (Visual IR)

- Text Information Retrieval
- Web Search
- Social Media Search
- Micro Blogs
- Twitter Blogs

- Simplest model
- Restricted Queries
- queries are boolean expressions

- For each item we have a list
- Like index of a book (Topic, Pagenumber), Close to glossary
- Document, Tokenize the text
- Inside document order tokens, heuristics to combine multiple tokens for index construction
- Document Frequency - How many times term appears

- Ordering (Right to Left)
- Proximity Search Leveraging the context)
- Encoding
- Normalization and keyword detection based on locale
- Accents, patterns
- Stemming (Chopping end of words to obtain root word)
- Porter algorithm for stemming
- Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
- Skip pointers(Find elements common between both lists, increment until they match)

Good Link

Linear Algebra and Calculus basics Playlist bookmarked based on reference from my colleague

Pauls Online Calculus Notes

MIT Linear Algebra Playlist

**Linear Algebra**

**Calculus**

Pauls Online Calculus Notes

MIT Linear Algebra Playlist

To Understand Linear Regression basics of Slope, Correlation was useful.

Slope Revision

**Finding the slope of a line from its graph**: Slope of a line

Slope = Change in Y / Change in X

Slope is constant for a line

**Simple Linear Regression**

**Reference Videos**

Slope Revision

Slope = Change in Y / Change in X

Slope is constant for a line

- One Explanatory Variable Simple Regression
- More than one Explanatory Variable multiple Regression
- X - Independant Variable(Explanatory), Y - Dependant Variable (Response)
- Good fitting line (Reasonable for predicting relationship) - Measure of Strength of Relationship (Co-relation)
- Correlation - A & B are observed at Same time
- Methods of Least Squares to estimate B0 and B1
- Residual = Observed - Predicted value (Above Line +ve, Below Line -ve)

IQR = Q3 - Q1 = 3rd Quartile - 1st Quartile

**Median**- Arrange data from lowest to highest**On Even dataset**- Average of two most middle numbers**On Odd dataset**- Single Number that is halfway into the set

IQR = Q3-Q1 = 20-9 = 11

BoxPlot is used to identify outliers

For Above Dataset

- Minimum Value - 5
- Q1 - 9
- Q2 - 14
- Q3 - 20
- Maximum Value - 50

2. **Any Value < (9-1.5(11)) = -7.5**

This was mostly on probability distribution functions. Couple of one liners from session

**Conditional Probability Distribution** - Value of one random variable not impacting another then random variables are independent

**Variance **- How many values < mean and > mean

**Covariance - **If X and Y are two independent variables Covariance is zero

**Correlation **Values between -1 and 1

**Correlation**

**Probability "Chebyshev Inequality"**

**Chebyshev Inequality **- For computing mean for specific region (Integrated over smaller region). Probability in tail for any random variable.

**Related Reads**

Happy Learning!!!

