Information Retrieval Notes
- Document Corpus (Collection of links / indexed web structures)
- When user is humming songs, based on it if we identify song then its classified as IR problem
- Multimedia IR (musics, video, analysing music videos)
- Photo Search (Visual IR)
- Text Information Retrieval
- Web Search
- Social Media Search
- Micro Blogs
- Twitter Blogs
- Simplest model
- Restricted Queries
- queries are boolean expressions
- For each item we have a list
- Like index of a book (Topic, Pagenumber), Close to glossary
- Document, Tokenize the text
- Inside document order tokens, heuristics to combine multiple tokens for index construction
- Document Frequency - How many times term appears
- Ordering (Right to Left)
- Proximity Search Leveraging the context)
- Encoding
- Normalization and keyword detection based on locale
- Accents, patterns
- Stemming (Chopping end of words to obtain root word)
- Porter algorithm for stemming
- Stopwords, normalization, tokenization, lower casing, stemming, non-latin alphabets, compounds, numbers
- Skip pointers(Find elements common between both lists, increment until they match)
No comments:
Post a Comment