"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 18, 2015

Class Notes - Information Retrieval

Information Retrieval Notes
  • Document Corpus (Collection of links / indexed web structures)
Examples of Information Retrieval
  • When user is humming songs, based on it if we identify song then its classified as IR problem
  • Multimedia IR (musics, video, analysing music videos)
  • Photo Search (Visual IR)
Information Retrieval Application Areas
  • Text Information Retrieval
  • Web Search
  • Social Media Search
  • Micro Blogs
  • Twitter Blogs
Boolean Information Retrieval
  • Simplest model
  • Restricted Queries
  • queries are boolean expressions
Inverted Index
  • For each item we have a list
  • Like index of a book (Topic, Pagenumber), Close to glossary
  • Document, Tokenize the text
  • Inside document order tokens, heuristics to combine multiple tokens for index construction
  • Document Frequency - How many times term appears
Challenges
  • Ordering (Right to Left)
  • Proximity Search Leveraging the context)
  • Encoding 
  • Normalization and keyword detection based on locale
  • Accents, patterns
  • Stemming (Chopping end of words to obtain root word)
  • Porter algorithm for stemming
  • Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
  • Skip pointers(Find elements common between both lists, increment until they match)
Happy Learning!!!

No comments: