"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

September 01, 2022

NLTK Basics

 By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc. (List Link)

  • Term Frequency: Number of times a word appears in a document/number of words in the document.
  • Document Frequency: Number of documents a word appears across documents 
BOW (Bag of Words)
  • BOW - Vector representing sentence
  • BOW does not preserve order
  • BOW Fails when
    • Food was good, not bad at all
    • Food was bad, not good at all
spaCy Full name is (spelled correctly) 

One-hot encodings vs Word embeddings
  • One-hot encodings - Represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word.
  • One Hot Encoding - the relationship is not captured by the one-hot encoding
  • Word embeddings - An embedding is a dense vector of floating point values. Words with similar meanings have similar vectors
  • The basic idea for training is that words occurring in similar contexts have similar meanings.

Ref - Link

Keep Exploring!!!

No comments: