"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

May 16, 2023

Encoder / Decoder Discussions

GPT

  • The GPT-2 is built using transformer decoder blocks
  • GPT generates one token at a time just like decoder of transformer and has causal language modeling so it is strictly decoder only model.
  • GPT-2 does not require the encoder part of the original transformer architecture as it is decoder-only, and there are no encoder attention blocks, so the decoder is equivalent to the encode
  • Next word as outputs but it is auto-regressive as each token in the sentence has the context of the previous words

BERT

  • BERT, on the other hand, uses transformer encoder blocks
  • BERT gained the ability to incorporate the context on both sides of a word to gain better results
  • BERT generates same number of tokens as input that can be fed to linear layer and uses masked language modeling so this is strictly encoder only model.
  • BERT, by contrast, is not auto-regressive. It uses the entire surrounding context all-at-once.

Decoder -  pay attention to specific segments from the encoder

Ref - Link1, Link2

Keep Exploring!!!

No comments: