"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 28, 2015

Information Retrieval Notes

Token - Sequence of characters, chopping functions and throwing tokens certain characters
Type -  Equivalence class of Tokens
Term - Type in IR Dictionary
Term Frequency - Number of times term t appears in document d
Log Frequency - (1+ log(tf), if tf > 0)
Document Frequency - Number of documents in collection the term appears
Inverse Document Frequency - Log(N/Dft) - (Number of documents in collection / Number of documents term t appears)
IDF = log[Total Docs / Docs contain the term]
Stemming - Crude heuristics chopping end of words. Collapse derivationally related words. Stemming increases recall because morphological variation of words are collapsed into single token enabling higher chances of retrieval
Lemmatization - Return to base word or dictionary form of word. Collapse different inflectional form of words.
Skip Pointers - post of length N, Sqrt(N) evenly placed pointers
Positional Index - Term: DocId <Pos1, Pos2>
Inverted Index - is a dictionary mapping each word token to a set of file names

Boolean Retrieval (AND, OR, NOT)
  • Easy to Implement
  • Computationally efficient
  • Expressiveness and Clarity
Cons of Boolean Retrieval
  • No Ranking
  • No Weighing
Discounted Cumulative Gain (DCG)
  • Highly relevant docs are more useful when they appear earlier in search results list
  • Highly relevant docs are more useful than marginally relevant docs
DCG - 2 power (relevance-1) / log2(i+1)

HITS - Hyperlink induced Topic Search
  • Authorities - Direct answer to information need. Homepage of microsoft.com
  • Hub - Good Links to pages answering the information
  • Wikipedia good example for both Hub & Authority

Happy Learning!!!
Post a Comment