Transformer - Let's relearn
These topics come and on and off. I was able to catch up with sliding windows, CNN, RNN, LSTM. Then a bit of Transformers and also how does it work in vision too :)
AI / ML won't let us feel guilty you have to still learn the basics.
Paper - Attention Is All You Need
Key Lessons
- Representation of the sequence
- Intra-attention of sequence order
- Encoder-decoder structure
- Encoder - a sequence of continuous representations
- The decoder then generates an output sequence (Positional encoding)
- Multi-Head Attention consists of several attention layers
Unofficial Walkthrough of Vision Transformer
- Image is also pixels, learning pixel representations then the same encoding / decoding can be applied.
Transformers for Image Recognition at Scale
Key Notes
- Input image as a sequence of image patches, similar to the sequence of word embeddings
- The Vision Transformer treats an input image as a sequence of patches
- ViT can learn features hard-coded into CNNs (such as awareness of grid structure)
- Image classification with Vision Transformer
- Image classification with Vision Transformer
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
- Split an image into fixed-size patches
- Linearly embed each of them
- Add position embeddings
- Feed the resulting sequence of vectors to a standard Transformer encoder
Do Vision Transformers See Like Convolutional Neural
— AK (@ak92501) August 20, 2021
Networks?
pdf: https://t.co/5Yz5F2PZwO
abs: https://t.co/bpHO2rOYDv
find striking differences between the two architectures, such as ViT having more uniform representations across all layers pic.twitter.com/0KT0KE16f9
Do Vision Transformers See Like Convolutional Neural Networks?
- Lower half of ResNet layers are similar to around the
- lowest quarter of ViT layers
- Highest ViT layers dissimilar to lower and higher ResNet layers.
Keep Thinking!!!
No comments:
Post a Comment