"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 11, 2016

Data Models

Captured Notes from Session #2 - Data Models

Hierarchical Data Models
  • Tree like structures
  • Used in Windows Registry
  • Frequent Use (IMS)
  • DL / 1 Programming language for IMS
  • Difficult to reorganize
Graph / Network Model
  • Organize collection of records in form of directed graph
  • 3 way relationships can't be maintained
ER Model
  • Defined in terms of Entity, Relationships
  • Never Caught on Physical Model
Object Oriented Database model
  • Difficult mapping programming objects to database objects
Relational Model
  • Better physical data independence
  • Better logical independence
  • Won because of linear algebra
Happy Learning!!!

February 01, 2016

World of Data Science

My second semester classes started. The first session was very interesting and a great introduction to world of data science. I have read / re-read same type of definitions / introductory articles on data science. Prof.Manish Singh session gave a whole new analogy and interesting examples to correlate with.

For big data I have always referred back to 4 Vs. Volume, Veracity, Velocity and Variety. In the same analogy the definition was presented as
  • Internet of Content - Youtube, Ebooks, Wikipedia, New Feeds
  • Internet of People - Email, Facebook, Linkedin etc
  • Internet of Things - Things Devices with UniqueID communicating / managing infrastructure
  • Internet of Location - Spatial Data related analysis 
This Internet of * is a good representation of different forms / flows of information representing four Vs

Big Data = Crude Oil

"Big data is about extracting the ‘crude oil’, transporting it in ‘mega-tankers’, siphoning it through ‘pipelines’ and storing it in massive ‘silos’"

Data Science – Data science is inter disciplinary field to extract knowledge from data.

Data Science workflow involves Data Visualization, Data Analysis, Data processing and Data Storage tasks. Some of tools used in each layer are listed below. 


Tools available

Data Visualization
Ambrose, Tableau, GWT, D3 / Infovis, R/Python, Gephi, Chaco (Graph partitioning tool)

Data Analysis
Mahout, Piggybank, Hive, Pegasus, Girap, Pig, AllReduce. MR

Data Processing

Scheduler – Azkaban, Oozie, Ivory
Cluster Monitoring – (Gangalia + Nagios), Chukwa, Zookeeper

Data Storage
HDFS, HSFTP (HDFS over HTTP), S3, KFS (Kosmos File System)
Data Movement – SQOOP, Flume, Scribe, Kafka, MessageQueue
Columnar Storage – Zebra
Key Value - Hbase

The key ingredients of Data Science are
·         Data Management System
·         Data Mining
·         Computational process to identify patterns in large data sets
·         Use techniques at intersection of multiple disciplines (AI, Stats, Machine Learning, Computer Networks)
·         Data Classification, Clustering, regression and association rule finding and anomaly detection
·         Process Mining
·         Aim to discover, monitor, improve real time processes (eg logs, events, alerts, rules)
·         Information Visualization
·         Visualization techniques for large data sets, Interactive Information Visualization, How to really visualize big data


Databases Vs Data Science
Databases Data Science
Data Value Previous Cheap
Data Volume Modest Massive
Structured Strongly (Schema) Weakly or none (text)
Priorities Consistency, Error Recovery, Auditability Speed, Availability, Query richness
Base Relational Algebra Linear algebra

PS: My professor had provided references to the examples; I am sharing this post based on notes / slides from my session.  

Happy Learning!!!

January 30, 2016

NOSQL Basics

Getting ready for second semester, Quick basics for NOSQL Basics. Two short reads on NOSQL

Difference between RDBMS & NOSQL

RDBMS - NOSQL
Scaleup - Scale out
Structured Data - Semi / Unstructured data
Atomic Transaction - Eventual Consistency
Stored structure differently in disk  

Atomic vs Eventual Transactions
Atomic - ATM transactions (Either all changes made or none will be made)
Eventual Consistency - They cannot guarantee all are done at this point, They will be completed at some point (Eventually)




Happy Learning!!!

January 18, 2016

Type I and Type II Error


Type I Error - Rejecting the null hypothesis eventhough it is true
Type II error, also known as a "false negative": the error of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature

I liked below comment from Khan Academy
The easiest way to think about Type 1 and Type 2 errors is in relation to medical tests. A type 1 error is where the person doesn't have the disease, but the test says they do (false positive). A type 2 error is where the person has the disease but the test doesn't pick it up (false negative).

Happy Learning!!

January 12, 2016

Loadrunner script generation from SOAP Web Service Request


I have been learning loadrunner basics for my work. I found this site extremely handy and useful

Loadrunner XML Tools

Basically this utility translates your soap request into loadrunner request. This is handy to customize, parametrize and take it further.

Happy Learning!!!

January 07, 2016

R and Hypothesis Tests

It took couple of months to completely Analyse and Arrive at Hypothesis testing learnings
  • Formulating and Identifying NULL Hypothesis and Alternate Hypothesis
  • Computing the Normal Distribution (Left Side, Right Side Both Side Tests)
  • Identifying Area under the region (Using pnorm in R language)
  • Compute T value or Z value
  • Compute P value
  • If p value < 0.05 then reject Null Hypothesis
  • If p value > 0.05 then reject Null Hypothesis
Finding P-Values Here we use the pnorm function.
Usage: P-value = pnorm(zx¯, lower.tail = ).
  • Left-Tailed Tests: P-value = pnorm(zx¯, lower.tail=TRUE)
  • Right-Tailed Tests: P-value = pnorm(zx¯, lower.tail=FALSE)
  • Two-Tailed Tests: P-value = 2 * pnorm( abs(zx¯), lower.tail=FALSE)
For below two problems Applying the above logic

R and Hypothesis Tests

Problem #1 - P Test Case
A rental car company claims the mean time to rent a car on their website is 60 seconds with a standard deviation of 30 seconds. A random sample of 36 customers attempted to rent a car on the website. The mean time to rent was 75 seconds. Is this enough evidence to contradict the company's claim? What is the p-value

H0 = No change in mean time
Ha > mean time is greater than 60 seconds

Population Mean = 60
Population SD = 30
Sample Population Mean = 75
Sample Count = 36

Considering - Population Mean = 60, Population SD = 30

SError of sample = sd / number of samples
Standard Error = 30 / sqrt(36)
Standard Error = 30 / 6 = 5

Z score = Sample Mean - Population Mean / Standard Error
Z score = 75-60/5 = 3

Two tailed tests since it has <> symbol
2*pnorm(75, mean=60, sd=5, lower.tail=FALSE)
p value = 0.002699796

Since p value is less than 0.05, you reject the null hypothesis

Problem #2
An outbreak of Salmonella related illness was attributed to ice cream produced at a certain factory. Scientists measured the level of Salmonella in 9 randomly sampled batches of ice cream. The levels (in MPN/g) were: 0.593 0.142 0.329 0.691 0.231 0.793 0.519 0.392 0.418. Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3 MPN/g? What is the p-value

H0 = mean is 0.3 MPN
Ha = mean is > 0.3 MPN

Option #1
Using R t-test
x = c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418)
t.test(x, alternative="greater", mu=0.3)

p-value = 0.02927, P value < 0.5 so we can reject null hypothesis

Option #2
populationmean = 0.3
samplemean  = 0.4564444
standarddeviation  = 0.2128439
9 random samples, degree of freedom = 8

collectedsample=c(0.593,0.142,0.329,0.691,0.231,0.793,0.519,0.392,0.418)
samplemean = mean(collectedsample)
standarddeviation = sd(collectedsample)
populationmean = 0.3
sdx = standarddeviation/3
t = (0.4564444-0.3)/(sdx)
t
df = 8
t value is  2.205058

pvalue = pt(-abs(t),df=8)
pvalue = pt(-abs(2.205058),df=8)

pvalue = 0.0292652

Since sample size is < 30 we cannot use pnorm function here

Happy Learning!!!

January 02, 2016

R + Stats

The Following course material is very useful for R + Stats Combinations. It's a great material for R learning. Captured below are notes from 5,6,7,8 chapters

Binomial Probability - Only two mutually exclusive events often referred as success, failure. Also called bernouli trial (Link )
R commands - The dbinom and pbinom functions

Discrete Probability Distributions

R command - pnorm
Command Syntax - pnorm(x, mean = , sd = , lower.tail= )

Two-Tailed Tests - Testing for the possibility of the relationship in both directions. This means that .025 is in each tail of the distribution

One-Tailed Tests - one-tailed test allots all of your alpha to testing the statistical significance in the one direction of interest. This means that .05 is in one tail of the distribution of your test statistic.

Alternative hypothesis has the > operator, right-tailed test 
Right-Tailed Tests: P-value = pnorm(zx¯, lower.tail=FALSE)

Alternative hypothesis has the < operator, left-tailed test 
Left-Tailed Tests: P-value = pnorm(zx¯, lower.tail=TRUE)

Alternative hypothesis has the ≠ operator, two-tailed (left and right) test
Two-Tailed Tests: P-value = 2 * pnorm( abs(zx¯), lower.tail=FALSE)

pnorm(x, µ, σ), 
  • x is an observation from a normal distribution 
  • mean µ 
  • standard deviation σ
Computing P value from t value 
pt(abs(t-value), df=degree of freedom)

Reference

Happy Learning!!!

December 31, 2015

R and Datascience


I found this site very interesting datascienceplus

Using R author has categorized
  • Data Loading
  • Data Management
  • Visualization
  • Stats
This really helps to align R learning accordingly. I am trying to repeat the pattern for my R learning's

Happy Learning and Happy New Year 2016!!!

December 28, 2015

Information Retrieval Notes


Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type -  Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)
  • Easy to Implement
  • Computationally efficient
  • Expressiveness and Clarity
Cons of Boolean Retrieval
  • No Ranking
  • No Weighing
Discounted Cumulative Gain (DCG)
  • Highly relevant docs are more useful when they appear earlier in search results list
  • Highly relevant docs are more useful than marginally relevant docs
DCG - 2 power (relevance-1) / log2(i+1)
NDCG = DCG / IDCG





HITS - Hyperlink induced Topic Search
  • Authorities - Direct answer to information need. Homepage of microsoft.com
  • Hub - Good Links to pages answering the information
  • Wikipedia good example for both Hub & Authority





Happy Learning!!!

December 24, 2015

T-Test

T-Test

- Developed in 1908 by William Gosset
- T-test referred as Student's t-test
- Mu, Sigma (Indicate Population parameters)
- X-Dash, S represent mean and standard deviation of sample




Hypothesis Tests in R



One Sample T-Test

Function - t.test example in R

Happy Learning!!!

December 23, 2015

Hypothesis Testing Basics


After exams I understood my improvement areas in terms of learning. Predominantly these are crucial chapters

- P test using R Programming
- P test using Python Programming
- Hypothesis test using R Programming
- Hypothesis test using Python Programming

I glanced through couple of sites, Bookmarking some of pointers

Normal Distribution Properties




Key Pointers
- Normal distribution unimodal and symmetric
- Mean (Mu)
- Standard Deviation (Sigma)
- 99.7% < 3 Sigma
- 95% < 2 Sigma
- Z > 2 (Unusual)
- pnorm (percentile of observation)
- Qnorm for quantile or cutoff values







Key Pointers 
- Creating Null and Alternate Hypothesis conditions
- Identifying sample space, standard error, population mean, standard deviation from input question
- Computing P value






Happy Learning!!!

November 24, 2015

t-test and z-test





Problems to workout (Good Compiled List)

References
Link1
Link2

Z - Scores

Z - Scores makes it easy to compare scores from distributions using different scales

Formula #1


Formula #2








Formula #3 for raw score computation is defined by 


Formula #4 for Standard Error

Trying out problems in link 

Problem 2. Suppose X is a normal random variable with a mean of 120 and a standard deviation of 20. Determine the probability that X is greater than 135.

Mean = 120
SD = 135
Z score = (135-120)/20 = 0.75

z score from attached link



Find P(Z < 0.75) = 0.7734
1 - 0.7734 = 0.2266

Problem 4. If the test scores of 400 students are normally distributed with a mean of 100 and a standard deviation of 10, approximately how many students scored between 90 and 110?
Mean = 100
SD = 10

For x = 90, z = (90-100)/10 = -1
For x = 110, z = (110-100)/10 = 1
For Z (< -1), 


= 0.1587
For Z (<1), 

= 0.8413
= 0.8413-0.1587
= 0.6826

Multiply this percentage by 400. After rounding, we get 273 students.

Problem 16. A traffic study shows that the average number of occupants in a car is 1.5 and the standard deviation is .35. In a sample of 45 cars, find the probability that the mean number of occupants is greater than 1.6.

Mean = 1.5
SD = .35

Applying Formula #2

P(mean > 1.6) = 1- P(mean < 1.6)
Z(1.6) = ((1.6-1.5)*sqrt(45)) / 0.35
          = 1.916

P(Z<1.6) =  0.9719
P(Z>1.6) = 1- 0.9719 = 0.0281

Happy Learning!!!


November 21, 2015

chi-square test for homogeneity

The chi-square test for homogeneity is a test made to determine whether several populations are similar or equal or homogeneous in some characteristics

This link was useful

I tried the problem provided in the link

Problem - Know how to compute the chi-square homegeniety test statistics.

Step 1 


Step 2



Step 3



1-pchisq(19,df=2) - R Command
7.485183e-05

Since it is less than 0.05, you reject the null hypothesis

Happy Learning!!!

Chi Square Test for Independence

  • Uses a cross classification table to examine the nature of the relationship between these variables
  • Tables are sometimes referred to as contingency tables
  • Determine variables are dependent on each other or not
Approach
  • H0: chi square test for independence is conducted by assuming that there is no relationship between the two variables
  • Ha: alternative hypothesis is that there is some relationship between the variables
The general formula for the degrees of freedom is the number of rows minus one, times the number of columns minus 1.

In terms of independence and dependence these hypotheses could be stated
  • H0 : X and Y are independent
  • H1 : X and Y are dependent
Expected Frequency = ((row total)*(column total))/Total Population

I liked the example provided in link  

Problem - Test for a Relationship between Sex and Class

X (Sex)
Y (Social Class) Male(M) Female(F) Total
Upper Middle (A) 33 29 62
Middle (B) 153 181             334
Working (C) 103 81 184
Lower (D) 16 14 30
Total 305 305            610

Table 10.12: Social Class Cross Classified by Sex of Respondents

Expected Frequency = ((row total)*(column total))/Total Population



1-pchisq(4.8748,df=3)
 0.1811978
Significance is greater than or equal to 0.05, you don't reject the null hypothesis

Results match with the problem although approach is different. The sum total sum is 610 (Total Sum)

Happy Learning!!!

Stats - Chi-Square Goodness of Fit Test

Purpose -  Test association of variables in two-way tables

The chi-square test is defined for the hypothesis:
H0: The data follow a specified distribution
Ha: The data do not follow the specified distribution
This means that if the significance value is less than 0.05, you reject the null hypothesis; if significance is greater than or equal to 0.05, you don't reject the null hypothesis

Formula is
I liked the example mentioned in notes

Problem - Testing an octadedral die to see if it is biased

Score 1 2 3 4 5 6 7 8
Frequency 7 10 11 9 12 10 14 7 (Observed)

Degree of Freedom = Number of entries - 1. Here is is 8-1 = 7
Test the hypothesis H0 - The Die is Fair
H1: Die is not fair
Significance level alpha = 0.005

Expected frequency is uniform distribution of Ei = Sum of all observed scores / 8(Number of items)
= 80/8 = 10

The expected values will be
Score 1 2 3 4 5 6 7 8
Frequency 10 10 10 10 10 10 10 10 (Expected)

To compute the score we need to find values of (Oi-Ei ), ((Oi-Ei )*(Oi-Ei ))/ Ei

For each element between  both the arrays


Compute chisquare value (R Command)
1-pchisq(4,df=7)
0.7797774

This is above significance level > 0.05. So we cannot reject null hypothesis

Answer - The Die is Fair

Happy Learning!!!

Good Read on Taylor Seris

Two summary points
  • A Taylor Series is an expansion of a function into an infinite sum of terms, like these ones
  • A derivative gives you the slope of a function at any point
Detailed Notes in link
Taylor series Formula Compilation - link

Happy Learning!!!

November 08, 2015

K Means Clustering


I'm slowly moving in Stats with a lot of learning. This post is from my class notes

K-means clustering

  • Finding groups of object similar to one another
  • Partitioning cluster approach
  • Mean moves every time (Within first few iterations it will converge)
  • Classify a given data set through a certain number of clusters
  • This does not fit well for Sparse / Dense clusters

Great 5 Minute Video



Step 1 - "Figure out centric of region"
Step 2 - "Select K Data points randomly"
Step 3 - "Assign each data point to nearest centre"
Step 4 - "Recalculate the new centroids"
Step 5 - "Repeat Step 3,4"

More Reads - K-Means Clustering

Happy Learning!!!

November 02, 2015

Quick Tip - Python Stemming Module Installation - Windows


Copy the scripts to package folder. Run the command easy_install.py specifying the package containing scripts.

Happy Learning!!!