"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 22, 2015

R Working on Normal Distributions & Binomial Distributions

Cumulative density function - CDF is summation of all the probabilities within a range. The CDF is the integral of the PDF (Probability density function).

Read Jack Lion Heart's answer to What is the difference between a probability density function and a cumulative distribution function? on Quora
  • “d” - returns the height of the probability density function
  • “p” - returns the cumulative density function
  • “q” - returns the inverse cumulative density function (quantiles)
  • “r” - returns randomly generated numbers
Reference - Link

R Examples
Mean - 500
Variance - 100

This score 700 is better than 97% of other scores

v <- c(-1,1,2)

Link Ref

Normal distribution is defined by the following probability density function, where μ is the population mean and σ2 is the variance.

Chi-squared Distribution - If X1,X2,…,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom

Binormial Distribution - binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. 

Two Extra Parameters - number of trials and the probability of success for a single trial

Distribution function
x <- seq(0,50,by=1)
y <- dbinom(x,50,0.2)

50 - Number of Trials
0.2 - Probability of success for each trial

Cumulative Probability Density Function
x <- seq(0,50,by=1)
y <- pbinom(x,50,0.5)

50 - Number of Trials
0.5 - Probability of success for each trial

Random Probability Density Function
x <- seq(0,50,by=1)
y <- rbinom(x,50,0.5)

Happy Learning, More Learning Needed...It's vast....Lot more efforts needed :)

September 18, 2015

Maths - Basics

Good Read

Happy Learning!!!

Class Notes - Information Retrieval

Information Retrieval Notes
  • Document Corpus (Collection of links / indexed web structures)
Examples of Information Retrieval
  • When user is humming songs, based on it if we identify song then its classified as IR problem
  • Multimedia IR (musics, video, analysing music videos)
  • Photo Search (Visual IR)
Information Retrieval Application Areas
  • Text Information Retrieval
  • Web Search
  • Social Media Search
  • Micro Blogs
  • Twitter Blogs
Boolean Information Retrieval
  • Simplest model
  • Restricted Queries
  • queries are boolean expressions
Inverted Index
  • For each item we have a list
  • Like index of a book (Topic, Pagenumber), Close to glossary
  • Document, Tokenize the text
  • Inside document order tokens, heuristics to combine multiple tokens for index construction
  • Document Frequency - How many times term appears
  • Ordering (Right to Left)
  • Proximity Search Leveraging the context)
  • Encoding 
  • Normalization and keyword detection based on locale
  • Accents, patterns
  • Stemming (Chopping end of words to obtain root word)
  • Porter algorithm for stemming
  • Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
  • Skip pointers(Find elements common between both lists, increment until they match)
Happy Learning!!!

September 13, 2015

Central Limit Theorem

Normal Distribution

Standard Normal Distribution - Mean = 0, Variance = 1

Distribution approaches to normal distribution for larger set of variables. 

"As n increases, the distribution of sample mean approaches normal distribution"

Central Limit Theorem - Almost all measurable "random" variables in real world follow some kind of normal distribution.

Good Link

"Sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger"

"Average of your sample means will be the population mean"

Happy Learning!!!

September 09, 2015

Linear Algebra Playlist

Linear Algebra and Calculus basics Playlist bookmarked based on reference from my colleague

Pauls Online Calculus Notes

MIT Linear Algebra Playlist

Linear Algebra


Linear Regression Basics

To Understand Linear Regression basics of Slope, Correlation was useful.

Slope Revision

Finding the slope of a line from its graph: Slope of a line

Slope = Change in Y / Change in X
Slope is constant for a line

Simple Linear Regression

  • One Explanatory Variable Simple Regression
  • More than one Explanatory Variable multiple Regression
  • X - Independant Variable(Explanatory), Y - Dependant Variable (Response)
  • Good fitting line (Reasonable for predicting relationship) - Measure of Strength of Relationship (Co-relation)
  • Correlation - A & B are observed at Same time
  • Methods of Least Squares to estimate B0 and B1
  • Residual = Observed - Predicted value (Above Line +ve, Below Line -ve)
Reference Videos

September 07, 2015

Covariance and Correlation - Random Variable - Probability

This video was useful to understand the covariance and correlation relationship

Happy Learning!!!

Working with R - InterQuartile Range

Concept - IQR - InterQuartile Range

IQR = Q3 - Q1 = 3rd Quartile - 1st Quartile
  • Median - Arrange data from lowest to highest
  • On Even dataset - Average of two most middle numbers
  • On Odd dataset - Single Number that is halfway into the set
Dataset - 5,6,12,13,15,18,22,50

Q2 = (13+15)/2 = 14 - Median of Data Value

Q1 = (6+12)/2 = 9 - Median Before Q2

Q3 = (18+22)/2 = 20 - Median After Q2

IQR = Q3-Q1 = 20-9 = 11

BoxPlot is used to identify outliers

For Above Dataset
  • Minimum Value - 5
  • Q1 - 9
  • Q2 - 14
  • Q3 - 20
  • Maximum Value - 50
This is the mathematical concept. This is used for finding outliers.

Outlier - Much larger or smaller than other values in data set. IQR obtained by subtracting third vs first quartile. 

Finding Outliers
1. Any value < Q1-1.5(IQR) or > Q3+1.5(IQR) is an outlier
2. Any Value < (9-1.5(11)) = -7.5
    Any Value > 20+1.5(11) = 20+16.5 = 36.5

This Video was useful to understand the concept before trying out in R

Computing using R

IQR between 25th percentile and the 75th

dataset <-c( 5,6,12,13,15,18,22,50 )
quantile(x=dataset, probs= c(.25,.75))

Sample Output

Outlier highlighted in circles

Happy Learning!!!

September 06, 2015

Class 3 - Statistics Notes

This was mostly on probability distribution functions. Couple of one liners from session

Conditional Probability Distribution - Value of one random variable not impacting another then random variables are independent

Variance - How many values < mean and > mean

Covariance - If X and Y are two independent variables Covariance is zero
Correlation Values between -1 and 1
Probability "Chebyshev Inequality"
Chebyshev Inequality - For computing mean for specific region (Integrated over smaller region). Probability in tail for any random variable.

Related Reads
Happy Learning!!!

September 03, 2015

R Basic Examples

Listed below are couple of basic examples working from R Console

Example 1 - Set and get working directory
Example 2 - Read from Data Files

Example 3 - Count row and columns in data

Example 4 - Functions

Example 5 - Plotting

Happy Learning!!!

August 30, 2015

Video Analytics Class 2

My Notes
  • Linear Filter - Linear combination of neighbours
  • Box filter - All values constant [1's]
  • Corr-relation - Masked and Moved across Image
  • Gradient  - due to surface normal discontinuity, depth discontinuity, illumination discontinuity
  • LOG - Laplace of Gaussian. LOG capable of finding edges
  • Salt and Pepper Image - Image has random black and white 
  • Represent Image as a Matrix
  • Represent Image as a function
  • Point, local operations, histogram equalization, moving average model
  • Cross Correlation g = H X F
  • Gaussian filter (Removes High frequency, blurring, smoothens image)
  • Symmetric Matrix (When you shift rows into columns it would appear the same ( aij = aji, for all indices i and j) example link 
Convolution Basics
Programmatic Walkthru - link

From link 
From link

From Link

Convolution Applications
  • Smoother image
  • Gaussian (Point spread function)
  • Different Kinds of filter (Box, Gaussian filter)

Cross Correlation - Assess how similar are two different functions. Compares position by position. 
Correlation Walkthru

From Link 

Mathematics concepts to learn
  • Vector Product
  • Eigen Value decomposition
  • First Derivative, Second Derivative
Vertical and horizontal edge detection filters - Sobel, Roberts, Prewitt (Veritical, Horizontal, Diagonal edge detection filters).

Good Read Link
MIT Course Slides link

Related Reads

Happy Learning!!!

August 25, 2015

OpenCV Python Basics

Basic image loading modules

Example #1

import cv2
import numpy as np
from matplotlib import pyplot as plt
#Load Image
source = cv2.imread('D:\images\Benz.png')

Ref - Link

Example #2

#Printing width and height of image
import cv2
import numpy as np
from matplotlib import pyplot as plt

#Load Image
source = cv2.imread('D:\images\Benz.png')
print source.shape

#Output Number of Rows / Columns
rowCount = source.shape[0]
columnCount = source.shape[1]
print rowCount, columnCount

Ref - Link

Example #3
#Drawing Histogram

import cv2
import numpy as np
from matplotlib import pyplot as plt

#Load Image
source = cv2.imread('D:\images\Benz.png')

# Image, Channel
# Channels - grayscale image - [0], RGB - 0,1,2
# Mask - Supplied None as FULL Region Needed
# histSize - Bin Size
# ranges - 0 to 256
hist = cv2.calcHist([source], [0], None, [256], [0,256])

Happy Learning!!!

August 24, 2015

SettingUp OpenCV and Python

Reference Steps - Link

1. Download and Install python from link. Install with Default Settings
2. Download and Install MatPlot Lib (Link in reference steps are fine)
3. Download and Install OpenCV executable. Extract it to C:\OpenCV location
4. Now All Installations are located in C:\
5. Open Python IDLE from program files. In win7 run-it-as admin to open Program Files ->Python IDLE
6. Copy File from below location
      From - C:\OpenCV\opencv\build\python\2.7\x86\cv2.pyd
      To - C:\Python27\Lib\site-packages
7. Got the Error Link
8. Download and install numpy from link (Link provided 1.7 of numpy is incorrect. You need 1.9.2)
9. Validating installation steps

Happy Learning!!!

August 23, 2015

Video Analytics Concepts

  • Image - Matrix of Intensity Values
  • Each pixel has a byte where you can store information
  • Image can be represented in a matrix
  • Image can be represented as a function
Image Processing Operations
  • Point, Local, Global
  • Point - Take every pixel and perform operation
  • Local - Local Neighbourhood data manipulation. Global is extension of local
Point Operations
  • Image Enhancement - new_Pixel = max-old_Pixel+min
  • Contrast Stretching - Histogram stretching

  • Local - Noise reduction using moving average
  • Weighted sum of neighbours computation
  • Linear Filtering - Cross Correlation, Gaussian Filter
  • Convolution Vs Correlation
Happy Learning!!!

Statistical Programming Notes - Class 1

  • Probability - Study of randomness and uncertainty
  • Random Experiment - Process whose outcome we cannot say predictably
  • Sample Space - All possible outcomes
  • Event - Subset of Sample Space
  • Probability Value - Expected occurence of outcome
Concept #1
Frequentist View 

P(Event) = Number of Times Expected Event Occured / Total Number of Events
P(A) = N(A) / N

Probability P of an uncertain event A, written P(A), is defined by the frequency of that event based on previous observations

More Reads on this topic
Frequentism and Bayesianism: A Practical Introduction

Concept #1.1
Bayesian - Assign based on Intuition

Concept #2 - Conditional Probability
P(E) occurring given that another dependant event has already occurred

P(A/B) = P(A Intersection B) / P(B)

More Read - Link

Bayes Theorem
Three production lines
48% Red - 6% Production Line Defective
31% Blue - 11% Production Line Defective
21% White - 8% Production Line Defective

P(R/D) = P(R Intersection D) / P(D)

P(R) = .48
P(D/R) = 0.06
P(D) = P(D/R)P(R) + P(D/W)P(W) + P(D/B)P(B)

P(R/D) = P(R Intersection D) / P(D)

P(R/D) = P(R Intersection D) / (P(D/R)P(R) + P(D/W)P(W) + P(D/B)P(B))

Concept #5
Independent Events - Independent Events are not affected by previous events

Dependent Events - Taking coloured marbles from a bag: as you take each marble there are less marbles left in the bag, so the probabilities change.

Independent Events - Taking and replacing coloured marbles from a bag: as you take each marble there are same marbles left in the bag, so the probabilities won't change.

R Programming Concepts
  • R - Interpreted language
  • Assignment operator =, <-
  • Vector - Same as Arrays in other languages
  • B <- matric(c(2,4,3,1,5,7) nrow = 3, ncol = 2). matrix(rows, columns)
  • In Matrix - Data Filled in columnar manner
  • matrix columnar, rowwise operations possible using apply command apply(b,2,mean) - Columnar mean, apply(b,1,mean) - rowwise mean
  • Data Frame - Way to store different types of columns, Values in one particular column need to be of same data type
Happy Learning!!!

August 19, 2015

Basics - Excellent Read for Database Enthusiasts

Basics - Excellent Read for Database Enthusiasts - How RDBMS works

Happy Learning!!!

August 17, 2015

Variance and Standard Deviation Example

Basic Example for Variance, Standard Deviation Computation

Happy Learning!!!

R Notes

Tip #1 
Declaration - matrix (0,3,4)
3 rows, 4 columns and values 0

Tip #2
Vector a <- 1:12 (To fill Matrix)
matrix(a, 3, 4)

Tip #3

Assigning, Fetching & Printing values
plank <- 1:8 - Create Vector
dim(plank) <- c(2,4) - Assign Dimension
print(plank) - Print values
plank[1,4] <- 0 - Assign value
plank[2,] - Fetch all 2nd row values

Tip #4 - Plotting Matrix

elevation <- matrix(2,5,10)
contour(elevation) - 2D representation
persp(elevation) - 3D representation
persp(elevation, expand = 0.2)

Tip #1 - Mean computation
a <- c(1,2,3,4,5)

Tip #2
Horizontal line across plots

Tip #3 
Standard Deviation
deviation <- sd(a)

Tip #1 
R has special collection called factors
chests <- c('gold', 'silver', 'gems', 'gold', 'gems')
types <- factor(chests)

Data Frame
Similar to Database
Tip #1 - Loading data from files

Tip #2 - Merge Data Frames
data1 <- read.csv("data.csv")
data2 <- read.table("a.txt",sep="\t")
merge(x = data1, y = data2)

Real World Examples
Data1 <- read.csv("a.csv")
Data2 <- read.table("b.txt", sep="  ", header=TRUE)
TargetData <- merge(x = Data1, y = Data2)
plot(TargetData$Data1, TargetData$Data2)

August 16, 2015

R Online Learning

I prefer to switch topics when I find it tricky to focus on one topic. I found R language easy, simple and great to get started. Codeschool has a beautiful self learning portal. This lists cheat sheet and fundamentals working with R. Capturing some of notes for my future reference.

Using R

Tip #1 - Assignment
x <- 42
y <- "Hello"

Tip #2 - Expression

Tip #3 - Arithmetic operations

Tip #4 - Functions


Tip #5 - File I/O


Tip #1 - Vector
List of Values - vector represented by c(2,3,4)
List of Strings - vector represented by c('a','b','c')
List with multiple data types - vector represented by c(1,'a', TRUE)

Tip #2 - Sequence Vectors
Representing sequence of numbers m to b by m:n
 - seq(10,50)
 - seq(10,50,5) - With increment step 5
 - seq(50,10) - Reverse sequence representation

Tip #3 - Assigning Vectors (Single Quotes)
sentence <- c('walk', 'the', 'plank')
a <- c (1,2,3)
b <- c (1,2,3)

Sample Vector Operations

x <- seq(1, 20, 0.1)
y <- sin(x)

Tip #4
c for combine vectors

Tip #5
Plotting Vectors
vectorCoordinates <- c(4, 5, 1)

x <- seq(1, 20, 0.1)
y <- sin(x)

Great Learning Sites
Code School

Happy Learning!!!

August 15, 2015

Recommendation Algorithm Analysis

Item to Item Rating based on customer’s purchase of products

The formula for comparison is dot product divided by product of vector lengths
In the example for two sets Book and DVD
  • Book – (1,1,1) – Set A consider it as (A1, A2, A3)
  • DVD – (1,0,0) – Set B consider it as (B1, B2, B3)
Formula works as
  • (A1.B1 + A2.B2 + A3.B3) /sqrt((A1 square + A2 Square + A3 Square)( B1 square + B2 Square + B3 Square))
  • (1)/sqrt((3).sqrt(1)
  • 1 / 1.732
  • 0.577
     Item to Item Comparison based on customer ratings

The formula for comparison is dot product divided by product of vector lengths
In the example for two sets Book and DVD
  • Book – (4,3,5) – Set A consider it as (A1, A2, A3)
  • DVD – (1,0,0) – Set B consider it as (B1, B2, B3)
Formula works as 
  • (A1.B1 + A2.B2 + A3.B3) /sqrt((A1 square + A2 Square + A3 Square)( B1 square + B2 Square + B3 Square))
  • (4)/sqrt((16+9+25).sqrt(1)
  • 4/7.07
  • 0.565
Analysis - By comparing multiple items the items that yield the maximum value would be recommended to the customer

Happy Learning!!!