1. Google Colab GPU Version
2. Sample Code
3. Sample Image
4. Sample results, Segmentation Time = 30 seconds
Key components - Image encoder, prompt encoder, and mask decoder.
- The image encoder is a pre-trained Masked Auto-Encoder Vision Transformer (MAE-ViT) that extracts an embedding for the image.
- The prompt encoder embeds prompts of different types, including points, bounding boxes, free-form text, or rough masks.
- The mask decoder has layers that use self-attention, cross-attention, and an MLP. They create a more informative image embedding, which is then used by another MLP to produce the final mask. The model also estimates IoU for later use in the process.
Keep Exploring!!!
No comments:
Post a Comment