"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 04, 2022

Attention - Lessons

From Application Developer vs Knowing how it works, Still trying to figure out, Be it backpropagation or Network design or Attention.

Wonderful thread 

Summarizing my lessons

  • Lesson #1 - Encoder takes embeddings and source masks
  • Lesson #2 - Decoder takes target embedding and target masks
  • Lesson #3 - Encoder has sequence of encoders one connected to each other, Encoder1 -> Output -> Encoder2 -> ... Encoder N
  • Lesson #4 - Encoder N will be connected to Decoder 1, Decoder has several Decoder Layers
  • Lesson #5 - Encoder contains one sequential layer + attention + feedforward layer
  • Lesson #6 - In RNN when we read we remember input gate, foreget gate, history gate, output gate. Something here you have the connection to self called self attention, Something like keeping the sequence history
  • Lesson #7 - Multihead attention = Multiple self attention layers
  • Lesson #8 - Self attention = attention to remember the same sequences, 1-2-3,1-2-3,Again a percentage of sequence might be picked up as historical info which may influnce the next token prediction
  • Lesson #9/#10/#11 - Forward function = Softmax + matrix multiplcation
  • Lesson #12/#13/#14 - Decoder has similar attention layer, multihead self attention
  • Lesson #15/#16 - Padding, Positional encoding with embedding layer
  • Lesson #17/#18 -  linear + softmax to decoder output

Keep Exploring!!!

No comments: