Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Information Retrieval Notes

December 28, 2015

Information Retrieval Notes

Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type - Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)

Easy to Implement
Computationally efficient
Expressiveness and Clarity

Cons of Boolean Retrieval

No Ranking
No Weighing

Discounted Cumulative Gain (DCG)

Highly relevant docs are more useful when they appear earlier in search results list
Highly relevant docs are more useful than marginally relevant docs

DCG - 2 power (relevance-1) / log2(i+1)
NDCG = DCG / IDCG