"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

April 11, 2023

Segment Anything model - Facebook Offering

1. Google Colab GPU Version

2. Sample Code


3. Sample Image

4. Sample results, Segmentation Time = 30 seconds

Key components - Image encoder, prompt encoder, and mask decoder.

  • The image encoder is a pre-trained Masked Auto-Encoder Vision Transformer (MAE-ViT) that extracts an embedding for the image.
  • The prompt encoder embeds prompts of different types, including points, bounding boxes, free-form text, or rough masks.
  • The mask decoder has layers that use self-attention, cross-attention, and an MLP. They create a more informative image embedding, which is then used by another MLP to produce the final mask. The model also estimates IoU for later use in the process.

Ref - Link
Paper - Link
Demo - Link


Keep Exploring!!!

No comments: