It needs a bit more careful *attention* to understand the crux of transformers. This lecture was useful
Slides - Link
Session -
Transfer Learning
- Use Neural Network on imagenet and finetune on custom data
- Better performance than anything else
Convert words to vectors
- One hot encoding
- Scales poorly with vocabulary size
- Sparse and high dimensional
- Map one hot to dense vectors (Embedding matrix)
- Finding Embedding matrix - Learn as part of tasks
- Learn the Language model
- Training on large corpus of text - wikipedia
- N-Grams, Sliding Window forming rows
- Binary classification - 0 / 1 - Neighbouring word or not
NLP Imagenet moment - Elmo / ULMfit
- ELMO - bidirectional stack LSTM
- ULMfit
Good Paper Read - SQuAD: 100,000+ Questions for Machine Comprehension of Text
Attention
- Only attention no LSTM
- Self-attention, positional encoding, Layer normalization
- Attention and Fully Connected Layers
Self Attention
- Input sequence of vectors
- Output weighted sum of input sequence
Learn weights
- Compute attention weight for its own output
- Compute every other vector to compute attention weights for its own output y_i (query)
- Compare to every other vector to compute attention weight w_ij for output y_j (key)
- Summed with other vectors to form the results of the attention weighted sum (value)
Multihead attention
- Weight matrices - query, key, value weights
- Multiple heads of attention just mean learning different sets of query, key and value matrices simultaneously
Transformer
- Self attention layer - layer normalization - dense layer
Layer Normalization
- Data scaling, weight initialization
- Rest things between uniform mean and standard deviation
Position Embedding
- Word embedding depends on word
- Position embedding depends on position
- Combine both and run through transformers
- Both position and content reasoned
Attention is all you need
- Translation
- Encoder - Decoder architecture
GPT - Generative pretrained transformer
- Generating text
- ELMo, ULMFIT
- Preceeding words
- GPT2 1.5 Billion parameters
BERT
- Bidirectional encoder representations from transformers
T5 - Text to Text Transfer Transformer
- Input and output as text streams
- 11 billion parameters
Keep Thinking!!!
No comments:
Post a Comment