Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): December 2015

December 31, 2015

R and Datascience

I found this site very interesting datascienceplus

Using R author has categorized

Data Loading
Data Management
Visualization
Stats

This really helps to align R learning accordingly. I am trying to repeat the pattern for my R learning's

Happy Learning and Happy New Year 2016!!!

December 28, 2015

Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type - Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)

Easy to Implement
Computationally efficient
Expressiveness and Clarity

Cons of Boolean Retrieval

No Ranking
No Weighing

Discounted Cumulative Gain (DCG)

Highly relevant docs are more useful when they appear earlier in search results list
Highly relevant docs are more useful than marginally relevant docs

DCG - 2 power (relevance-1) / log2(i+1)
NDCG = DCG / IDCG

HITS - Hyperlink induced Topic Search

Authorities - Direct answer to information need. Homepage of microsoft.com
Hub - Good Links to pages answering the information
Wikipedia good example for both Hub & Authority

Happy Learning!!!

December 24, 2015

T-Test

T-Test

- Developed in 1908 by William Gosset
- T-test referred as Student's t-test
- Mu, Sigma (Indicate Population parameters)
- X-Dash, S represent mean and standard deviation of sample

Hypothesis Tests in R

One Sample T-Test

Function - t.test example in R

Happy Learning!!!

December 23, 2015

Hypothesis Testing Basics

After exams I understood my improvement areas in terms of learning. Predominantly these are crucial chapters

- P test using R Programming
- P test using Python Programming
- Hypothesis test using R Programming
- Hypothesis test using Python Programming

I glanced through couple of sites, Bookmarking some of pointers

Normal Distribution Properties

Key Pointers
- Normal distribution unimodal and symmetric
- Mean (Mu)
- Standard Deviation (Sigma)
- 99.7% < 3 Sigma
- 95% < 2 Sigma
- Z > 2 (Unusual)
- pnorm (percentile of observation)
- Qnorm for quantile or cutoff values

Key Pointers
- Creating Null and Alternate Hypothesis conditions
- Identifying sample space, standard error, population mean, standard deviation from input question
- Computing P value