"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 31, 2015

R and Datascience


I found this site very interesting datascienceplus

Using R author has categorized
  • Data Loading
  • Data Management
  • Visualization
  • Stats
This really helps to align R learning accordingly. I am trying to repeat the pattern for my R learning's

Happy Learning and Happy New Year 2016!!!

December 28, 2015

Information Retrieval Notes

Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type -  Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)
  • Easy to Implement
  • Computationally efficient
  • Expressiveness and Clarity
Cons of Boolean Retrieval
  • No Ranking
  • No Weighing
Discounted Cumulative Gain (DCG)
  • Highly relevant docs are more useful when they appear earlier in search results list
  • Highly relevant docs are more useful than marginally relevant docs
DCG - 2 power (relevance-1) / log2(i+1)
NDCG = DCG / IDCG





HITS - Hyperlink induced Topic Search
  • Authorities - Direct answer to information need. Homepage of microsoft.com
  • Hub - Good Links to pages answering the information
  • Wikipedia good example for both Hub & Authority









Happy Learning!!!

December 24, 2015

T-Test

T-Test

- Developed in 1908 by William Gosset
- T-test referred as Student's t-test
- Mu, Sigma (Indicate Population parameters)
- X-Dash, S represent mean and standard deviation of sample




Hypothesis Tests in R



One Sample T-Test

Function - t.test example in R

Happy Learning!!!

December 23, 2015

Hypothesis Testing Basics


After exams I understood my improvement areas in terms of learning. Predominantly these are crucial chapters

- P test using R Programming
- P test using Python Programming
- Hypothesis test using R Programming
- Hypothesis test using Python Programming

I glanced through couple of sites, Bookmarking some of pointers

Normal Distribution Properties




Key Pointers
- Normal distribution unimodal and symmetric
- Mean (Mu)
- Standard Deviation (Sigma)
- 99.7% < 3 Sigma
- 95% < 2 Sigma
- Z > 2 (Unusual)
- pnorm (percentile of observation)
- Qnorm for quantile or cutoff values







Key Pointers 
- Creating Null and Alternate Hypothesis conditions
- Identifying sample space, standard error, population mean, standard deviation from input question
- Computing P value






Happy Learning!!!

November 24, 2015

t-test and z-test





Problems to workout (Good Compiled List)

References
Link1
Link2

Z - Scores

Z - Scores makes it easy to compare scores from distributions using different scales

Formula #1


Formula #2








Formula #3 for raw score computation is defined by 


Formula #4 for Standard Error

Trying out problems in link 

Problem 2. Suppose X is a normal random variable with a mean of 120 and a standard deviation of 20. Determine the probability that X is greater than 135.

Mean = 120
SD = 135
Z score = (135-120)/20 = 0.75

z score from attached link



Find P(Z < 0.75) = 0.7734
1 - 0.7734 = 0.2266

Problem 4. If the test scores of 400 students are normally distributed with a mean of 100 and a standard deviation of 10, approximately how many students scored between 90 and 110?
Mean = 100
SD = 10

For x = 90, z = (90-100)/10 = -1
For x = 110, z = (110-100)/10 = 1
For Z (< -1), 


= 0.1587
For Z (<1), 

= 0.8413
= 0.8413-0.1587
= 0.6826

Multiply this percentage by 400. After rounding, we get 273 students.

Problem 16. A traffic study shows that the average number of occupants in a car is 1.5 and the standard deviation is .35. In a sample of 45 cars, find the probability that the mean number of occupants is greater than 1.6.

Mean = 1.5
SD = .35

Applying Formula #2

P(mean > 1.6) = 1- P(mean < 1.6)
Z(1.6) = ((1.6-1.5)*sqrt(45)) / 0.35
          = 1.916

P(Z<1.6) =  0.9719
P(Z>1.6) = 1- 0.9719 = 0.0281

Happy Learning!!!


November 21, 2015

chi-square test for homogeneity

The chi-square test for homogeneity is a test made to determine whether several populations are similar or equal or homogeneous in some characteristics

This link was useful

I tried the problem provided in the link

Problem - Know how to compute the chi-square homegeniety test statistics.

Step 1 


Step 2



Step 3



1-pchisq(19,df=2) - R Command
7.485183e-05

Since it is less than 0.05, you reject the null hypothesis

Happy Learning!!!

Chi Square Test for Independence

  • Uses a cross classification table to examine the nature of the relationship between these variables
  • Tables are sometimes referred to as contingency tables
  • Determine variables are dependent on each other or not
Approach
  • H0: chi square test for independence is conducted by assuming that there is no relationship between the two variables
  • Ha: alternative hypothesis is that there is some relationship between the variables
The general formula for the degrees of freedom is the number of rows minus one, times the number of columns minus 1.

In terms of independence and dependence these hypotheses could be stated
  • H0 : X and Y are independent
  • H1 : X and Y are dependent
Expected Frequency = ((row total)*(column total))/Total Population

I liked the example provided in link  

Problem - Test for a Relationship between Sex and Class

X (Sex)
Y (Social Class) Male(M) Female(F) Total
Upper Middle (A) 33 29 62
Middle (B) 153 181             334
Working (C) 103 81 184
Lower (D) 16 14 30
Total 305 305            610

Table 10.12: Social Class Cross Classified by Sex of Respondents

Expected Frequency = ((row total)*(column total))/Total Population



1-pchisq(4.8748,df=3)
 0.1811978
Significance is greater than or equal to 0.05, you don't reject the null hypothesis

Results match with the problem although approach is different. The sum total sum is 610 (Total Sum)

Happy Learning!!!

Stats - Chi-Square Goodness of Fit Test

Purpose -  Test association of variables in two-way tables

The chi-square test is defined for the hypothesis:
H0: The data follow a specified distribution
Ha: The data do not follow the specified distribution
This means that if the significance value is less than 0.05, you reject the null hypothesis; if significance is greater than or equal to 0.05, you don't reject the null hypothesis

Formula is
I liked the example mentioned in notes

Problem - Testing an octadedral die to see if it is biased

Score 1 2 3 4 5 6 7 8
Frequency 7 10 11 9 12 10 14 7 (Observed)

Degree of Freedom = Number of entries - 1. Here is is 8-1 = 7
Test the hypothesis H0 - The Die is Fair
H1: Die is not fair
Significance level alpha = 0.005

Expected frequency is uniform distribution of Ei = Sum of all observed scores / 8(Number of items)
= 80/8 = 10

The expected values will be
Score 1 2 3 4 5 6 7 8
Frequency 10 10 10 10 10 10 10 10 (Expected)

To compute the score we need to find values of (Oi-Ei ), ((Oi-Ei )*(Oi-Ei ))/ Ei

For each element between  both the arrays


Compute chisquare value (R Command)
1-pchisq(4,df=7)
0.7797774

This is above significance level > 0.05. So we cannot reject null hypothesis

Answer - The Die is Fair

Happy Learning!!!

Good Read on Taylor Seris

Two summary points
  • A Taylor Series is an expansion of a function into an infinite sum of terms, like these ones
  • A derivative gives you the slope of a function at any point
Detailed Notes in link
Taylor series Formula Compilation - link

Happy Learning!!!

November 08, 2015

K Means Clustering


I'm slowly moving in Stats with a lot of learning. This post is from my class notes

K-means clustering

  • Finding groups of object similar to one another
  • Partitioning cluster approach
  • Mean moves every time (Within first few iterations it will converge)
  • Classify a given data set through a certain number of clusters
  • This does not fit well for Sparse / Dense clusters

Great 5 Minute Video



Step 1 - "Figure out centric of region"
Step 2 - "Select K Data points randomly"
Step 3 - "Assign each data point to nearest centre"
Step 4 - "Recalculate the new centroids"
Step 5 - "Repeat Step 3,4"

More Reads - K-Means Clustering

DTW  - Dynamic Time Warping Algorithm. DTW - allowing similar shapes to match even if they are out of phase in the time axis

Ref - Link

Happy Learning!!!

November 02, 2015

Quick Tip - Python Stemming Module Installation - Windows


Copy the scripts to package folder. Run the command easy_install.py specifying the package containing scripts.

Happy Learning!!!

October 31, 2015

5 minute quick learning - Naïve Bayes



Good 5 Minute Learning!!!

Conditional Probability - Easy Walkthrough

I take iterations to understand / try out a concept. Going back and learning after sometime it's interesting. This post is a quick explanation on Conditional Probability.

P(A) - Probability of Event A to occur
P(A/B) - Probability of A given that B has already occurred. - This we refer as conditional probability

Problem - Roll a fair die. Let A be event of odd outcomes. B be event where outcome <=3. What is probability of A and Probability of A given B has already occurred

A = Odd Outcomes = {1,3,5} = 3
B = Outcome <=3 = {1,2,3} = 3
Sample space = {1,2,3,4,5,6} = 6

Probability P(A) = |A| / |S| = 3/6 = 1/2

Probability P(A/B) = Probability of A given that B has already occurred

From B outcomes {1,2,3}, Possible A values are = {1,3}

p(A/B) = Events{1,3} / Events of B{1,3,5}
p(A/B)  = 2/3

Bayes Theorem

P(CD) = P(C/D)P(D)

P(CD) = P(D/C)P(C)

Equating both the formulas

P(D/C)P(C) = P(C/D)P(D)

P(D/C) = (P(C/D)P(D)) / P(C)

P(C/D) = ((P(D/C)P(C)) / P(D)

Happy 2 Minute Quick Learning!!!

September 22, 2015

R Working on Normal Distributions & Binomial Distributions


Cumulative density function - CDF is summation of all the probabilities within a range. The CDF is the integral of the PDF (Probability density function).

Read Jack Lion Heart's answer to What is the difference between a probability density function and a cumulative distribution function? on Quora
  • “d” - returns the height of the probability density function
  • “p” - returns the cumulative density function
  • “q” - returns the inverse cumulative density function (quantiles)
  • “r” - returns randomly generated numbers
Reference - Link

R Examples
pnorm(700,500,100)
Mean - 500
Variance - 100

This score 700 is better than 97% of other scores

v <- c(-1,1,2)
dnorm(v)
plot(v,dnorm(v))
plot(v,pnorm(v))
dnorm(0,mean(v),sd(v))

Link Ref

Normal distribution is defined by the following probability density function, where μ is the population mean and σ2 is the variance.

Chi-squared Distribution - If X1,X2,…,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom

Binormial Distribution - binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. 

Two Extra Parameters - number of trials and the probability of success for a single trial

Distribution function
x <- seq(0,50,by=1)
y <- dbinom(x,50,0.2)
plot(x,y)

50 - Number of Trials
0.2 - Probability of success for each trial

Cumulative Probability Density Function
x <- seq(0,50,by=1)
y <- pbinom(x,50,0.5)
plot(x,y)

50 - Number of Trials
0.5 - Probability of success for each trial

Random Probability Density Function
x <- seq(0,50,by=1)
y <- rbinom(x,50,0.5)
plot(x,y)

Happy Learning, More Learning Needed...It's vast....Lot more efforts needed :)

September 18, 2015

Maths - Basics







Good Read
Link1

Happy Learning!!!

Class Notes - Information Retrieval

Information Retrieval Notes
  • Document Corpus (Collection of links / indexed web structures)
Examples of Information Retrieval
  • When user is humming songs, based on it if we identify song then its classified as IR problem
  • Multimedia IR (musics, video, analysing music videos)
  • Photo Search (Visual IR)
Information Retrieval Application Areas
  • Text Information Retrieval
  • Web Search
  • Social Media Search
  • Micro Blogs
  • Twitter Blogs
Boolean Information Retrieval
  • Simplest model
  • Restricted Queries
  • queries are boolean expressions
Inverted Index
  • For each item we have a list
  • Like index of a book (Topic, Pagenumber), Close to glossary
  • Document, Tokenize the text
  • Inside document order tokens, heuristics to combine multiple tokens for index construction
  • Document Frequency - How many times term appears
Challenges
  • Ordering (Right to Left)
  • Proximity Search Leveraging the context)
  • Encoding 
  • Normalization and keyword detection based on locale
  • Accents, patterns
  • Stemming (Chopping end of words to obtain root word)
  • Porter algorithm for stemming
  • Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
  • Skip pointers(Find elements common between both lists, increment until they match)
Happy Learning!!!

September 13, 2015

Central Limit Theorem


Normal Distribution



Standard Normal Distribution - Mean = 0, Variance = 1

Distribution approaches to normal distribution for larger set of variables. 

"As n increases, the distribution of sample mean approaches normal distribution"

Central Limit Theorem - Almost all measurable "random" variables in real world follow some kind of normal distribution.

Good Link

"Sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger"

"Average of your sample means will be the population mean"




Happy Learning!!!

September 09, 2015

Linear Algebra Playlist

Linear Algebra and Calculus basics Playlist bookmarked based on reference from my colleague

Pauls Online Calculus Notes

MIT Linear Algebra Playlist



Linear Algebra


Calculus



Linear Regression Basics

To Understand Linear Regression basics of Slope, Correlation was useful.

Slope Revision

Finding the slope of a line from its graph: Slope of a line


Slope = Change in Y / Change in X
Slope is constant for a line

Simple Linear Regression

  • One Explanatory Variable Simple Regression
  • More than one Explanatory Variable multiple Regression
  • X - Independant Variable(Explanatory), Y - Dependant Variable (Response)
  • Good fitting line (Reasonable for predicting relationship) - Measure of Strength of Relationship (Co-relation)
  • Correlation - A & B are observed at Same time
  • Methods of Least Squares to estimate B0 and B1
  • Residual = Observed - Predicted value (Above Line +ve, Below Line -ve)
Reference Videos

September 07, 2015

Covariance and Correlation - Random Variable - Probability


This video was useful to understand the covariance and correlation relationship


Happy Learning!!!

Working with R - InterQuartile Range

Concept - IQR - InterQuartile Range

IQR = Q3 - Q1 = 3rd Quartile - 1st Quartile
  • Median - Arrange data from lowest to highest
  • On Even dataset - Average of two most middle numbers
  • On Odd dataset - Single Number that is halfway into the set
Dataset - 5,6,12,13,15,18,22,50

Q2 = (13+15)/2 = 14 - Median of Data Value

Q1 = (6+12)/2 = 9 - Median Before Q2

Q3 = (18+22)/2 = 20 - Median After Q2

IQR = Q3-Q1 = 20-9 = 11

BoxPlot is used to identify outliers

For Above Dataset
  • Minimum Value - 5
  • Q1 - 9
  • Q2 - 14
  • Q3 - 20
  • Maximum Value - 50
This is the mathematical concept. This is used for finding outliers.

Outlier - Much larger or smaller than other values in data set. IQR obtained by subtracting third vs first quartile. 

Finding Outliers
1. Any value < Q1-1.5(IQR) or > Q3+1.5(IQR) is an outlier
2. Any Value < (9-1.5(11)) = -7.5
    Any Value > 20+1.5(11) = 20+16.5 = 36.5

This Video was useful to understand the concept before trying out in R


Computing using R

IQR between 25th percentile and the 75th

dataset <-c( 5,6,12,13,15,18,22,50 )
quantile(x=dataset, probs= c(.25,.75))
IQR(x=dataset)
boxplot(dataset)

Sample Output


Outlier highlighted in circles

Happy Learning!!!

September 06, 2015

Class 3 - Statistics Notes

This was mostly on probability distribution functions. Couple of one liners from session

Conditional Probability Distribution - Value of one random variable not impacting another then random variables are independent

Variance - How many values < mean and > mean

Covariance - If X and Y are two independent variables Covariance is zero
Correlation Values between -1 and 1
Correlation
Probability "Chebyshev Inequality"
Chebyshev Inequality - For computing mean for specific region (Integrated over smaller region). Probability in tail for any random variable.



Related Reads
Happy Learning!!!

September 03, 2015

R Basic Examples

Listed below are couple of basic examples working from R Console

Example 1 - Set and get working directory
Example 2 - Read from Data Files

Example 3 - Count row and columns in data

Example 4 - Functions

Example 5 - Plotting



Happy Learning!!!

August 30, 2015

Video Analytics Class 2

My Notes
  • Linear Filter - Linear combination of neighbours
  • Box filter - All values constant [1's]
  • Corr-relation - Masked and Moved across Image
  • Gradient  - due to surface normal discontinuity, depth discontinuity, illumination discontinuity
  • LOG - Laplace of Gaussian. LOG capable of finding edges
  • Salt and Pepper Image - Image has random black and white 
Basics
  • Represent Image as a Matrix
  • Represent Image as a function
  • Point, local operations, histogram equalization, moving average model
  • Cross Correlation g = H X F
  • Gaussian filter (Removes High frequency, blurring, smoothens image)
  • Symmetric Matrix (When you shift rows into columns it would appear the same ( aij = aji, for all indices i and j) example link 
Convolution Basics
Programmatic Walkthru - link

From link 
From link

From Link
FFT

Convolution Applications
  • Smoother image
  • Gaussian (Point spread function)
  • Different Kinds of filter (Box, Gaussian filter)

Cross Correlation - Assess how similar are two different functions. Compares position by position. 
Correlation Walkthru

From Link 

Mathematics concepts to learn
  • Vector Product
  • Eigen Value decomposition
  • First Derivative, Second Derivative
Vertical and horizontal edge detection filters - Sobel, Roberts, Prewitt (Veritical, Horizontal, Diagonal edge detection filters).

Good Read Link
MIT Course Slides link

Related Reads







Happy Learning!!!

August 25, 2015

OpenCV Python Basics

Basic image loading modules

Example #1

import cv2
import numpy as np
from matplotlib import pyplot as plt
#Load Image
source = cv2.imread('D:\images\Benz.png')

Ref - Link

Example #2

#Printing width and height of image
import cv2
import numpy as np
from matplotlib import pyplot as plt

#Load Image
source = cv2.imread('D:\images\Benz.png')
print source.shape

#Output Number of Rows / Columns
rowCount = source.shape[0]
columnCount = source.shape[1]
print rowCount, columnCount

Ref - Link

Example #3
#Drawing Histogram

import cv2
import numpy as np
from matplotlib import pyplot as plt

#Load Image
source = cv2.imread('D:\images\Benz.png')

# Image, Channel
# Channels - grayscale image - [0], RGB - 0,1,2
# Mask - Supplied None as FULL Region Needed
# histSize - Bin Size
# ranges - 0 to 256
hist = cv2.calcHist([source], [0], None, [256], [0,256])
plt.plot(hist)
plt.show()

Happy Learning!!!

August 24, 2015

SettingUp OpenCV and Python

Reference Steps - Link

1. Download and Install python from link. Install with Default Settings
2. Download and Install MatPlot Lib (Link in reference steps are fine)
3. Download and Install OpenCV executable. Extract it to C:\OpenCV location
4. Now All Installations are located in C:\
5. Open Python IDLE from program files. In win7 run-it-as admin to open Program Files ->Python IDLE
6. Copy File from below location
      From - C:\OpenCV\opencv\build\python\2.7\x86\cv2.pyd
      To - C:\Python27\Lib\site-packages
7. Got the Error Link
8. Download and install numpy from link (Link provided 1.7 of numpy is incorrect. You need 1.9.2)
9. Validating installation steps

Happy Learning!!!