"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

March 01, 2021

Back to Basics - Fundamentals - RNN - Transformers

 It needs a bit more careful *attention* to understand the crux of transformers. This lecture was useful

Slides - Link

Session - 

Transfer Learning

  • Use Neural Network on imagenet and finetune on custom data
  • Better performance than anything else
Convert words to vectors

  • One hot encoding
  • Scales poorly with vocabulary size
  • Sparse and high dimensional
  • Map one hot to dense vectors (Embedding matrix)
  • Finding Embedding matrix - Learn as part of tasks
  • Learn the Language model
  • Training on large corpus of text - wikipedia
  • N-Grams, Sliding Window forming rows
  • Binary classification - 0 / 1 - Neighbouring word or not
NLP Imagenet moment - Elmo / ULMfit

  • ELMO - bidirectional stack LSTM
  • ULMfit

Good Paper Read - SQuAD: 100,000+ Questions for Machine Comprehension of Text






Attention

  • Only attention no LSTM
  • Self-attention, positional encoding, Layer normalization
  • Attention and Fully Connected Layers

Self Attention

  • Input sequence of vectors
  • Output weighted sum of input sequence

Learn weights

  • Compute attention weight for its own output
  • Compute every other vector to compute attention weights for its own output y_i (query)
  • Compare to every other vector to compute attention weight w_ij for output y_j (key)
  • Summed with other vectors to form the results of the attention weighted sum (value)

Multihead attention

  • Weight matrices - query, key, value weights
  • Multiple heads of attention just mean learning different sets of query, key and value matrices simultaneously

Transformer

  • Self attention layer - layer normalization - dense layer



Layer Normalization

  • Data scaling, weight initialization
  • Rest things between uniform mean and standard deviation

Position Embedding

  • Word embedding depends on word
  • Position embedding depends on position 
  • Combine both and run through transformers
  • Both position and content reasoned

Attention is all you need

  • Translation
  • Encoder - Decoder architecture

GPT - Generative pretrained transformer

  • Generating text
  • ELMo, ULMFIT
  • Preceeding words
  • GPT2 1.5 Billion parameters

BERT

  • Bidirectional encoder representations from transformers

T5 - Text to Text Transfer Transformer

  • Input and output as text streams
  • 11 billion parameters

Keep Thinking!!!

No comments: