"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 31, 2015

R and Datascience


I found this site very interesting datascienceplus

Using R author has categorized
  • Data Loading
  • Data Management
  • Visualization
  • Stats
This really helps to align R learning accordingly. I am trying to repeat the pattern for my R learning's

Happy Learning and Happy New Year 2016!!!

December 28, 2015

Information Retrieval Notes

Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type -  Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)
  • Easy to Implement
  • Computationally efficient
  • Expressiveness and Clarity
Cons of Boolean Retrieval
  • No Ranking
  • No Weighing
Discounted Cumulative Gain (DCG)
  • Highly relevant docs are more useful when they appear earlier in search results list
  • Highly relevant docs are more useful than marginally relevant docs
DCG - 2 power (relevance-1) / log2(i+1)
NDCG = DCG / IDCG





HITS - Hyperlink induced Topic Search
  • Authorities - Direct answer to information need. Homepage of microsoft.com
  • Hub - Good Links to pages answering the information
  • Wikipedia good example for both Hub & Authority









Happy Learning!!!

December 24, 2015

T-Test

T-Test

- Developed in 1908 by William Gosset
- T-test referred as Student's t-test
- Mu, Sigma (Indicate Population parameters)
- X-Dash, S represent mean and standard deviation of sample




Hypothesis Tests in R



One Sample T-Test

Function - t.test example in R

Happy Learning!!!

December 23, 2015

Hypothesis Testing Basics


After exams I understood my improvement areas in terms of learning. Predominantly these are crucial chapters

- P test using R Programming
- P test using Python Programming
- Hypothesis test using R Programming
- Hypothesis test using Python Programming

I glanced through couple of sites, Bookmarking some of pointers

Normal Distribution Properties




Key Pointers
- Normal distribution unimodal and symmetric
- Mean (Mu)
- Standard Deviation (Sigma)
- 99.7% < 3 Sigma
- 95% < 2 Sigma
- Z > 2 (Unusual)
- pnorm (percentile of observation)
- Qnorm for quantile or cutoff values







Key Pointers 
- Creating Null and Alternate Hypothesis conditions
- Identifying sample space, standard error, population mean, standard deviation from input question
- Computing P value






Happy Learning!!!