Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): 2015

December 31, 2015

R and Datascience

I found this site very interesting datascienceplus

Using R author has categorized

Data Loading
Data Management
Visualization
Stats

This really helps to align R learning accordingly. I am trying to repeat the pattern for my R learning's

Happy Learning and Happy New Year 2016!!!

December 28, 2015

Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type - Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)

Easy to Implement
Computationally efficient
Expressiveness and Clarity

Cons of Boolean Retrieval

No Ranking
No Weighing

Discounted Cumulative Gain (DCG)

Highly relevant docs are more useful when they appear earlier in search results list
Highly relevant docs are more useful than marginally relevant docs

DCG - 2 power (relevance-1) / log2(i+1)
NDCG = DCG / IDCG

HITS - Hyperlink induced Topic Search

Authorities - Direct answer to information need. Homepage of microsoft.com
Hub - Good Links to pages answering the information
Wikipedia good example for both Hub & Authority

Happy Learning!!!

December 24, 2015

T-Test

T-Test

- Developed in 1908 by William Gosset
- T-test referred as Student's t-test
- Mu, Sigma (Indicate Population parameters)
- X-Dash, S represent mean and standard deviation of sample

Hypothesis Tests in R

One Sample T-Test

Function - t.test example in R

Happy Learning!!!

December 23, 2015

Hypothesis Testing Basics

After exams I understood my improvement areas in terms of learning. Predominantly these are crucial chapters

- P test using R Programming
- P test using Python Programming
- Hypothesis test using R Programming
- Hypothesis test using Python Programming

I glanced through couple of sites, Bookmarking some of pointers

Normal Distribution Properties

Key Pointers
- Normal distribution unimodal and symmetric
- Mean (Mu)
- Standard Deviation (Sigma)
- 99.7% < 3 Sigma
- 95% < 2 Sigma
- Z > 2 (Unusual)
- pnorm (percentile of observation)
- Qnorm for quantile or cutoff values

Key Pointers
- Creating Null and Alternate Hypothesis conditions
- Identifying sample space, standard error, population mean, standard deviation from input question
- Computing P value

Happy Learning!!!

November 24, 2015

t-test and z-test

Problems to workout (Good Compiled List)

References
Link1
Link2

Z - Scores

Z - Scores makes it easy to compare scores from distributions using different scales

Formula #1

Formula #2

Formula #3 for raw score computation is defined by

Formula #4 for Standard Error

Trying out problems in link

Problem 2. Suppose X is a normal random variable with a mean of 120 and a standard deviation of 20. Determine the probability that X is greater than 135.

Mean = 120
SD = 135
Z score = (135-120)/20 = 0.75

z score from attached link

Find P(Z < 0.75) = 0.7734
1 - 0.7734 = 0.2266

Problem 4. If the test scores of 400 students are normally distributed with a mean of 100 and a standard deviation of 10, approximately how many students scored between 90 and 110?
Mean = 100
SD = 10

For x = 90, z = (90-100)/10 = -1
For x = 110, z = (110-100)/10 = 1
For Z (< -1),

= 0.1587
For Z (<1),

= 0.8413
= 0.8413-0.1587
= 0.6826

Multiply this percentage by 400. After rounding, we get 273 students.

Problem 16. A traffic study shows that the average number of occupants in a car is 1.5 and the standard deviation is .35. In a sample of 45 cars, find the probability that the mean number of occupants is greater than 1.6.

Mean = 1.5
SD = .35

Applying Formula #2

P(mean > 1.6) = 1- P(mean < 1.6)
Z(1.6) = ((1.6-1.5)*sqrt(45)) / 0.35
= 1.916

P(Z<1.6) = 0.9719
P(Z>1.6) = 1- 0.9719 = 0.0281

Happy Learning!!!

November 21, 2015

chi-square test for homogeneity

The chi-square test for homogeneity is a test made to determine whether several populations are similar or equal or homogeneous in some characteristics

This link was useful

I tried the problem provided in the link

Problem - Know how to compute the chi-square homegeniety test statistics.

Step 1

Step 2

Step 3

1-pchisq(19,df=2) - R Command

7.485183e-05

Since it is less than 0.05, you reject the null hypothesis

Happy Learning!!!

Chi Square Test for Independence

Uses a cross classification table to examine the nature of the relationship between these variables
Tables are sometimes referred to as contingency tables
Determine variables are dependent on each other or not

Approach

H0: chi square test for independence is conducted by assuming that there is no relationship between the two variables
Ha: alternative hypothesis is that there is some relationship between the variables

The general formula for the degrees of freedom is the number of rows minus one, times the number of columns minus 1.

In terms of independence and dependence these hypotheses could be stated

H0 : X and Y are independent
H1 : X and Y are dependent

Expected Frequency = ((row total)*(column total))/Total Population

I liked the example provided in link

Problem - Test for a Relationship between Sex and Class

X (Sex)

Y (Social Class) Male(M) Female(F) Total

Upper Middle (A) 33 29 62

Middle (B) 153 181 334

Working (C) 103 81 184

Lower (D) 16 14 30

Total 305 305 610

Table 10.12: Social Class Cross Classified by Sex of Respondents

Expected Frequency = ((row total)*(column total))/Total Population

1-pchisq(4.8748,df=3)

0.1811978

Significance is greater than or equal to 0.05, you don't reject the null hypothesis

Results match with the problem although approach is different. The sum total sum is 610 (Total Sum)

Happy Learning!!!

Stats - Chi-Square Goodness of Fit Test

Purpose - Test association of variables in two-way tables

The chi-square test is defined for the hypothesis:
H0: The data follow a specified distribution
Ha: The data do not follow the specified distribution
This means that if the significance value is less than 0.05, you reject the null hypothesis; if significance is greater than or equal to 0.05, you don't reject the null hypothesis

Formula is

I liked the example mentioned in notes

Problem - Testing an octadedral die to see if it is biased

Score 1 2 3 4 5 6 7 8

Frequency 7 10 11 9 12 10 14 7 (Observed)

Degree of Freedom = Number of entries - 1. Here is is 8-1 = 7

Test the hypothesis H0 - The Die is Fair

H1: Die is not fair

Significance level alpha = 0.005

Expected frequency is uniform distribution of Ei = Sum of all observed scores / 8(Number of items)

= 80/8 = 10

The expected values will be

Score 1 2 3 4 5 6 7 8

Frequency 10 10 10 10 10 10 10 10 (Expected)

To compute the score we need to find values of (Oi-Ei ), ((Oi-Ei )*(Oi-Ei ))/ Ei

For each element between both the arrays

Compute chisquare value (R Command)

1-pchisq(4,df=7)

0.7797774

This is above significance level > 0.05. So we cannot reject null hypothesis

Answer - The Die is Fair

Happy Learning!!!

Good Read on Taylor Seris

Two summary points

A Taylor Series is an expansion of a function into an infinite sum of terms, like these ones
A derivative gives you the slope of a function at any point

Detailed Notes in link
Taylor series Formula Compilation - link

Happy Learning!!!

November 19, 2015

Good Formula Lists Compilations

Happy Learning!!!

November 08, 2015

K Means Clustering

I'm slowly moving in Stats with a lot of learning. This post is from my class notes

K-means clustering

Finding groups of object similar to one another
Partitioning cluster approach
Mean moves every time (Within first few iterations it will converge)
Classify a given data set through a certain number of clusters
This does not fit well for Sparse / Dense clusters

Great 5 Minute Video

Step 1 - "Figure out centric of region"
Step 2 - "Select K Data points randomly"
Step 3 - "Assign each data point to nearest centre"
Step 4 - "Recalculate the new centroids"
Step 5 - "Repeat Step 3,4"

More Reads - K-Means Clustering

DTW - Dynamic Time Warping Algorithm. DTW - allowing similar shapes to match even if they are out of phase in the time axis

Ref - Link

Happy Learning!!!

November 02, 2015

Quick Tip - Python Stemming Module Installation - Windows

Copy the scripts to package folder. Run the command easy_install.py specifying the package containing scripts.

Happy Learning!!!

October 31, 2015

5 minute quick learning - Naïve Bayes

Good 5 Minute Learning!!!

Conditional Probability - Easy Walkthrough

I take iterations to understand / try out a concept. Going back and learning after sometime it's interesting. This post is a quick explanation on Conditional Probability.

P(A) - Probability of Event A to occur
P(A/B) - Probability of A given that B has already occurred. - This we refer as conditional probability

Problem - Roll a fair die. Let A be event of odd outcomes. B be event where outcome <=3. What is probability of A and Probability of A given B has already occurred

A = Odd Outcomes = {1,3,5} = 3
B = Outcome <=3 = {1,2,3} = 3
Sample space = {1,2,3,4,5,6} = 6

Probability P(A) = |A| / |S| = 3/6 = 1/2

Probability P(A/B) = Probability of A given that B has already occurred

From B outcomes {1,2,3}, Possible A values are = {1,3}

p(A/B) = Events{1,3} / Events of B{1,3,5}
p(A/B) = 2/3

Bayes Theorem

P(CD) = P(C/D)P(D)

P(CD) = P(D/C)P(C)

Equating both the formulas

P(D/C)P(C) = P(C/D)P(D)

P(D/C) = (P(C/D)P(D)) / P(C)

P(C/D) = ((P(D/C)P(C)) / P(D)

Happy 2 Minute Quick Learning!!!

September 22, 2015

R Working on Normal Distributions & Binomial Distributions

Cumulative density function - CDF is summation of all the probabilities within a range. The CDF is the integral of the PDF (Probability density function).

Read Jack Lion Heart's answer to What is the difference between a probability density function and a cumulative distribution function? on Quora

“d” - returns the height of the probability density function
“p” - returns the cumulative density function
“q” - returns the inverse cumulative density function (quantiles)
“r” - returns randomly generated numbers

Reference - Link

R Examples

pnorm

pnorm(700,500,100)

Mean - 500

Variance - 100

This score 700 is better than 97% of other scores

v <- c(-1,1,2)

dnorm(v)

plot(v,dnorm(v))

plot(v,pnorm(v))

dnorm(0,mean(v),sd(v))

Link Ref

Normal distribution is defined by the following probability density function, where μ is the population mean and σ2 is the variance.

Chi-squared Distribution - If X1,X2,…,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom

Binormial Distribution - binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment.

Two Extra Parameters - number of trials and the probability of success for a single trial

Distribution function

x <- seq(0,50,by=1)

y <- dbinom(x,50,0.2)

plot(x,y)

50 - Number of Trials

0.2 - Probability of success for each trial

Cumulative Probability Density Function

x <- seq(0,50,by=1)

y <- pbinom(x,50,0.5)

plot(x,y)

50 - Number of Trials

0.5 - Probability of success for each trial

Random Probability Density Function

x <- seq(0,50,by=1)

y <- rbinom(x,50,0.5)

plot(x,y)

Happy Learning, More Learning Needed...It's vast....Lot more efforts needed :)

September 18, 2015

Maths - Basics

Good Read
Link1

Happy Learning!!!

Class Notes - Information Retrieval

Information Retrieval Notes

Document Corpus (Collection of links / indexed web structures)

Examples of Information Retrieval

When user is humming songs, based on it if we identify song then its classified as IR problem
Multimedia IR (musics, video, analysing music videos)
Photo Search (Visual IR)

Information Retrieval Application Areas

Text Information Retrieval
Web Search
Social Media Search
Micro Blogs
Twitter Blogs

Boolean Information Retrieval

Simplest model
Restricted Queries
queries are boolean expressions

Inverted Index

For each item we have a list
Like index of a book (Topic, Pagenumber), Close to glossary
Document, Tokenize the text
Inside document order tokens, heuristics to combine multiple tokens for index construction
Document Frequency - How many times term appears

Challenges

Ordering (Right to Left)
Proximity Search Leveraging the context)
Encoding
Normalization and keyword detection based on locale
Accents, patterns
Stemming (Chopping end of words to obtain root word)
Porter algorithm for stemming
Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
Skip pointers(Find elements common between both lists, increment until they match)

Happy Learning!!!

September 13, 2015

Central Limit Theorem

Normal Distribution

Standard Normal Distribution - Mean = 0, Variance = 1

Distribution approaches to normal distribution for larger set of variables.

"As n increases, the distribution of sample mean approaches normal distribution"

Central Limit Theorem - Almost all measurable "random" variables in real world follow some kind of normal distribution.

Good Link

"Sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger"

"Average of your sample means will be the population mean"

Happy Learning!!!

September 09, 2015

Linear Algebra Playlist

Linear Algebra and Calculus basics Playlist bookmarked based on reference from my colleague

Pauls Online Calculus Notes

MIT Linear Algebra Playlist

Linear Algebra

Calculus

More Reads
Link1
Link2
LA in 4 Pages
LA Machine Learning
Link3
Link4
Link5

Happy Learning!!!

Linear Regression Basics

To Understand Linear Regression basics of Slope, Correlation was useful.

Slope Revision

Finding the slope of a line from its graph: Slope of a line

Slope = Change in Y / Change in X
Slope is constant for a line

Simple Linear Regression

One Explanatory Variable Simple Regression
More than one Explanatory Variable multiple Regression
X - Independant Variable(Explanatory), Y - Dependant Variable (Response)
Good fitting line (Reasonable for predicting relationship) - Measure of Strength of Relationship (Co-relation)
Correlation - A & B are observed at Same time
Methods of Least Squares to estimate B0 and B1
Residual = Observed - Predicted value (Above Line +ve, Below Line -ve)

Reference Videos

Squared error of regression line: Introduction to the idea that one can find a line that minimizes the squared distances to the points

Happy Learning!!!

September 07, 2015

Covariance and Correlation - Random Variable - Probability

This video was useful to understand the covariance and correlation relationship

Happy Learning!!!

Working with R - InterQuartile Range

Concept - IQR - InterQuartile Range

IQR = Q3 - Q1 = 3rd Quartile - 1st Quartile

Median - Arrange data from lowest to highest
On Even dataset - Average of two most middle numbers
On Odd dataset - Single Number that is halfway into the set

Dataset - 5,6,12,13,15,18,22,50

Q2 = (13+15)/2 = 14 - Median of Data Value

Q1 = (6+12)/2 = 9 - Median Before Q2

Q3 = (18+22)/2 = 20 - Median After Q2

IQR = Q3-Q1 = 20-9 = 11

BoxPlot is used to identify outliers

For Above Dataset

Minimum Value - 5
Q1 - 9
Q2 - 14
Q3 - 20
Maximum Value - 50

This is the mathematical concept. This is used for finding outliers.

Outlier - Much larger or smaller than other values in data set. IQR obtained by subtracting third vs first quartile.

Finding Outliers

1. Any value < Q1-1.5(IQR) or > Q3+1.5(IQR) is an outlier

2. Any Value < (9-1.5(11)) = -7.5

Any Value > 20+1.5(11) = 20+16.5 = 36.5

This Video was useful to understand the concept before trying out in R

Computing using R

IQR between 25th percentile and the 75th

dataset <-c( 5,6,12,13,15,18,22,50 )
quantile(x=dataset, probs= c(.25,.75))
IQR(x=dataset)
boxplot(dataset)

Sample Output

Outlier highlighted in circles

Happy Learning!!!

September 06, 2015

Class 3 - Statistics Notes

This was mostly on probability distribution functions. Couple of one liners from session

Conditional Probability Distribution - Value of one random variable not impacting another then random variables are independent

Variance - How many values < mean and > mean

Covariance - If X and Y are two independent variables Covariance is zero
Correlation Values between -1 and 1
Correlation

Probability "Chebyshev Inequality"
Chebyshev Inequality - For computing mean for specific region (Integrated over smaller region). Probability in tail for any random variable.

Related Reads

Happy Learning!!!

September 03, 2015

R Basic Examples

Listed below are couple of basic examples working from R Console

Example 1 - Set and get working directory

Example 2 - Read from Data Files

Example 3 - Count row and columns in data

Example 4 - Functions

Example 5 - Plotting

Happy Learning!!!

August 30, 2015

Video Analytics Class 2

My Notes

Linear Filter - Linear combination of neighbours
Box filter - All values constant [1's]
Corr-relation - Masked and Moved across Image
Gradient - due to surface normal discontinuity, depth discontinuity, illumination discontinuity
LOG - Laplace of Gaussian. LOG capable of finding edges
Salt and Pepper Image - Image has random black and white

Basics

Represent Image as a Matrix
Represent Image as a function
Point, local operations, histogram equalization, moving average model
Cross Correlation g = H X F
Gaussian filter (Removes High frequency, blurring, smoothens image)
Symmetric Matrix (When you shift rows into columns it would appear the same ( aij = aji, for all indices i and j) example link

Convolution Basics

Programmatic Walkthru - link

From link

From link

From Link
FFT

Convolution Applications

Smoother image
Gaussian (Point spread function)
Different Kinds of filter (Box, Gaussian filter)

Cross Correlation - Assess how similar are two different functions. Compares position by position.

Correlation Walkthru

From Link

Mathematics concepts to learn

Vector Product
Eigen Value decomposition
First Derivative, Second Derivative

Vertical and horizontal edge detection filters - Sobel, Roberts, Prewitt (Veritical, Horizontal, Diagonal edge detection filters).

Good Read Link
MIT Course Slides link

Related Reads

Happy Learning!!!

August 25, 2015

OpenCV Python Basics

Basic image loading modules

Example #1

import cv2
import numpy as np
from matplotlib import pyplot as plt
#Load Image
source = cv2.imread('D:\images\Benz.png')

Ref - Link

Example #2

#Printing width and height of image
import cv2
import numpy as np
from matplotlib import pyplot as plt

#Load Image
source = cv2.imread('D:\images\Benz.png')
print source.shape

#Output Number of Rows / Columns
rowCount = source.shape[0]
columnCount = source.shape[1]
print rowCount, columnCount

Ref - Link

Example #3
#Drawing Histogram

import cv2
import numpy as np
from matplotlib import pyplot as plt

#Load Image
source = cv2.imread('D:\images\Benz.png')

# Image, Channel
# Channels - grayscale image - [0], RGB - 0,1,2
# Mask - Supplied None as FULL Region Needed
# histSize - Bin Size
# ranges - 0 to 256
hist = cv2.calcHist([source], [0], None, [256], [0,256])
plt.plot(hist)
plt.show()

Happy Learning!!!

August 24, 2015

SettingUp OpenCV and Python

Reference Steps - Link

1. Download and Install python from link. Install with Default Settings
2. Download and Install MatPlot Lib (Link in reference steps are fine)
3. Download and Install OpenCV executable. Extract it to C:\OpenCV location

4. Now All Installations are located in C:\

5. Open Python IDLE from program files. In win7 run-it-as admin to open Program Files ->Python IDLE

6. Copy File from below location

From - C:\OpenCV\opencv\build\python\2.7\x86\cv2.pyd

To - C:\Python27\Lib\site-packages

7. Got the Error Link

8. Download and install numpy from link (Link provided 1.7 of numpy is incorrect. You need 1.9.2)

9. Validating installation steps

Happy Learning!!!

December 31, 2015

December 28, 2015

December 24, 2015

December 23, 2015

November 24, 2015

November 21, 2015

November 19, 2015

November 08, 2015

November 02, 2015

October 31, 2015

September 22, 2015

September 18, 2015

September 13, 2015

September 09, 2015

September 07, 2015

September 06, 2015

September 03, 2015

August 30, 2015

August 25, 2015

August 24, 2015

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts