"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 22, 2015

R Working on Normal Distributions & Binomial Distributions


Cumulative density function - CDF is summation of all the probabilities within a range. The CDF is the integral of the PDF (Probability density function).

Read Jack Lion Heart's answer to What is the difference between a probability density function and a cumulative distribution function? on Quora
  • “d” - returns the height of the probability density function
  • “p” - returns the cumulative density function
  • “q” - returns the inverse cumulative density function (quantiles)
  • “r” - returns randomly generated numbers
Reference - Link

R Examples
pnorm(700,500,100)
Mean - 500
Variance - 100

This score 700 is better than 97% of other scores

v <- c(-1,1,2)
dnorm(v)
plot(v,dnorm(v))
plot(v,pnorm(v))
dnorm(0,mean(v),sd(v))

Link Ref

Normal distribution is defined by the following probability density function, where μ is the population mean and σ2 is the variance.

Chi-squared Distribution - If X1,X2,…,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom

Binormial Distribution - binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. 

Two Extra Parameters - number of trials and the probability of success for a single trial

Distribution function
x <- seq(0,50,by=1)
y <- dbinom(x,50,0.2)
plot(x,y)

50 - Number of Trials
0.2 - Probability of success for each trial

Cumulative Probability Density Function
x <- seq(0,50,by=1)
y <- pbinom(x,50,0.5)
plot(x,y)

50 - Number of Trials
0.5 - Probability of success for each trial

Random Probability Density Function
x <- seq(0,50,by=1)
y <- rbinom(x,50,0.5)
plot(x,y)

Happy Learning, More Learning Needed...It's vast....Lot more efforts needed :)

September 18, 2015

Maths - Basics







Good Read
Link1

Happy Learning!!!

Class Notes - Information Retrieval

Information Retrieval Notes
  • Document Corpus (Collection of links / indexed web structures)
Examples of Information Retrieval
  • When user is humming songs, based on it if we identify song then its classified as IR problem
  • Multimedia IR (musics, video, analysing music videos)
  • Photo Search (Visual IR)
Information Retrieval Application Areas
  • Text Information Retrieval
  • Web Search
  • Social Media Search
  • Micro Blogs
  • Twitter Blogs
Boolean Information Retrieval
  • Simplest model
  • Restricted Queries
  • queries are boolean expressions
Inverted Index
  • For each item we have a list
  • Like index of a book (Topic, Pagenumber), Close to glossary
  • Document, Tokenize the text
  • Inside document order tokens, heuristics to combine multiple tokens for index construction
  • Document Frequency - How many times term appears
Challenges
  • Ordering (Right to Left)
  • Proximity Search Leveraging the context)
  • Encoding 
  • Normalization and keyword detection based on locale
  • Accents, patterns
  • Stemming (Chopping end of words to obtain root word)
  • Porter algorithm for stemming
  • Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
  • Skip pointers(Find elements common between both lists, increment until they match)
Happy Learning!!!

September 13, 2015

Central Limit Theorem


Normal Distribution



Standard Normal Distribution - Mean = 0, Variance = 1

Distribution approaches to normal distribution for larger set of variables. 

"As n increases, the distribution of sample mean approaches normal distribution"

Central Limit Theorem - Almost all measurable "random" variables in real world follow some kind of normal distribution.

Good Link

"Sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger"

"Average of your sample means will be the population mean"




Happy Learning!!!

September 09, 2015

Linear Algebra Playlist

Linear Algebra and Calculus basics Playlist bookmarked based on reference from my colleague

Pauls Online Calculus Notes

MIT Linear Algebra Playlist



Linear Algebra


Calculus



Linear Regression Basics

To Understand Linear Regression basics of Slope, Correlation was useful.

Slope Revision

Finding the slope of a line from its graph: Slope of a line


Slope = Change in Y / Change in X
Slope is constant for a line

Simple Linear Regression

  • One Explanatory Variable Simple Regression
  • More than one Explanatory Variable multiple Regression
  • X - Independant Variable(Explanatory), Y - Dependant Variable (Response)
  • Good fitting line (Reasonable for predicting relationship) - Measure of Strength of Relationship (Co-relation)
  • Correlation - A & B are observed at Same time
  • Methods of Least Squares to estimate B0 and B1
  • Residual = Observed - Predicted value (Above Line +ve, Below Line -ve)
Reference Videos

September 07, 2015

Covariance and Correlation - Random Variable - Probability


This video was useful to understand the covariance and correlation relationship


Happy Learning!!!

Working with R - InterQuartile Range

Concept - IQR - InterQuartile Range

IQR = Q3 - Q1 = 3rd Quartile - 1st Quartile
  • Median - Arrange data from lowest to highest
  • On Even dataset - Average of two most middle numbers
  • On Odd dataset - Single Number that is halfway into the set
Dataset - 5,6,12,13,15,18,22,50

Q2 = (13+15)/2 = 14 - Median of Data Value

Q1 = (6+12)/2 = 9 - Median Before Q2

Q3 = (18+22)/2 = 20 - Median After Q2

IQR = Q3-Q1 = 20-9 = 11

BoxPlot is used to identify outliers

For Above Dataset
  • Minimum Value - 5
  • Q1 - 9
  • Q2 - 14
  • Q3 - 20
  • Maximum Value - 50
This is the mathematical concept. This is used for finding outliers.

Outlier - Much larger or smaller than other values in data set. IQR obtained by subtracting third vs first quartile. 

Finding Outliers
1. Any value < Q1-1.5(IQR) or > Q3+1.5(IQR) is an outlier
2. Any Value < (9-1.5(11)) = -7.5
    Any Value > 20+1.5(11) = 20+16.5 = 36.5

This Video was useful to understand the concept before trying out in R


Computing using R

IQR between 25th percentile and the 75th

dataset <-c( 5,6,12,13,15,18,22,50 )
quantile(x=dataset, probs= c(.25,.75))
IQR(x=dataset)
boxplot(dataset)

Sample Output


Outlier highlighted in circles

Happy Learning!!!

September 06, 2015

Class 3 - Statistics Notes

This was mostly on probability distribution functions. Couple of one liners from session

Conditional Probability Distribution - Value of one random variable not impacting another then random variables are independent

Variance - How many values < mean and > mean

Covariance - If X and Y are two independent variables Covariance is zero
Correlation Values between -1 and 1
Correlation
Probability "Chebyshev Inequality"
Chebyshev Inequality - For computing mean for specific region (Integrated over smaller region). Probability in tail for any random variable.



Related Reads
Happy Learning!!!

September 03, 2015

R Basic Examples

Listed below are couple of basic examples working from R Console

Example 1 - Set and get working directory
Example 2 - Read from Data Files

Example 3 - Count row and columns in data

Example 4 - Functions

Example 5 - Plotting



Happy Learning!!!