GPT
- The GPT-2 is built using transformer decoder blocks
- GPT generates one token at a time just like decoder of transformer and has causal language modeling so it is strictly decoder only model.
- GPT-2 does not require the encoder part of the original transformer architecture as it is decoder-only, and there are no encoder attention blocks, so the decoder is equivalent to the encode
- Next word as outputs but it is auto-regressive as each token in the sentence has the context of the previous words
BERT
- BERT, on the other hand, uses transformer encoder blocks
- BERT gained the ability to incorporate the context on both sides of a word to gain better results
- BERT generates same number of tokens as input that can be fed to linear layer and uses masked language modeling so this is strictly encoder only model.
- BERT, by contrast, is not auto-regressive. It uses the entire surrounding context all-at-once.
Decoder - pay attention to specific segments from the encoder
Keep Exploring!!!
No comments:
Post a Comment