Bag of Words
- Create new column for each unique word in data
- Count occurrences in each documents
- sklearn.feature_extraction.text.CountVectorizer
- More comparable by using Term Frequency
- tf = 1 / x.sum(axis=1)[:,None]
- x = x*tf
- Inverse Document Frequency
- idf = np.log(x.shape[0])/(x>0).sum(0)
- N Grams
- Bag of Words (Each row represents text, Each column represents unique word)
- Classifying document
For N = 1, This is a sentence
Unigrams are - This, is, a , sentence
For N = 2, This is a sentence
bigrams are - This is, is a, a sentence
For N = 3, This is a sentence
Trigrams are - This is a, is a sentence
sklearn.feature_extraction.text.CountVectorizer: Ngram_range, analyzer
Text Preprocessing steps
- Lower case
- Lemmatization (using knowledge of vocabulary and morphological analysis of words)
- democracy, democratic and democratization -> democracy (Lemmatization)
- Stemming (Chops of ending of words)
- democracy, democratic, and democratization - democr (Stemming)
- Stop words (Not contain important information)
sklearn.feature_extraction.text.CountVectorizer: max_df has parameters for stop words
I have done all this in my assignment work. This is there in my github code
For Applying Bag of words
- Preprocessing - Lowercase, Stemming, Lemmatization, stopwords removal
- Ngrams can help use local context
- Postprocessing - TFiDF
- Use BOW for Ngrams
BOW example
- Sentence - The dog is on the table
- Representation - are, cat, dog, is, now, on, the, table
- BOW representation - 0, 0, 1, 1, 0, 1, 1, 1
BOW Issue
The food was good, not bad at all
The food was bad, not good at all
Both representations are the same however the meaning varies :)
Word to Vectors
- Get vector representation of words and texts
- Each word converted to vector
- Uses nearby words
- Different words used in same context will be used in vector representation
- Apply basic operations can be done on vectors
- Words - Word2Vec, Glove, FastText
- Sentences - Doc2Vec
- There are pretrained models
Bag of Words
- Very large vectors
- Meaning of each value in vector is unknown
Word2Vec
- Relatively small vectors
- Values of vector can be interpreted only in some cases
- The words with similar meaning often have similar embeddings
Happy Learning, Happy Coding!!!