By default, NLTK (Natural Language Toolkit) includes a list of 40 stop words, including: “a”, “an”, “the”, “of”, “in”, etc. (List Link)
- Term Frequency: Number of times a word appears in a document/number of words in the document.
- Document Frequency: Number of documents a word appears across documents
BOW (Bag of Words)
- BOW - Vector representing sentence
- BOW does not preserve order
- BOW Fails when
- Food was good, not bad at all
- Food was bad, not good at all
spaCy Full name is (spelled correctly)
One-hot encodings vs Word embeddings
- One-hot encodings - Represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word.
- One Hot Encoding - the relationship is not captured by the one-hot encoding
- Word embeddings - An embedding is a dense vector of floating point values. Words with similar meanings have similar vectors
- The basic idea for training is that words occurring in similar contexts have similar meanings.
Keep Exploring!!!
No comments:
Post a Comment