Ref - Post
- Product region, brand logo region
- Product textual data (title, brands)
- The regions of interest in images were detected by a pretrained teacher model
- Following the trend of using free-form text, we train the CPG model with 2.3M product entities synthesized from an e-commerce site in a self-supervised fashion
- The bounding boxes for product-noun-to-object task are generated by a pre-trained general domain modulated detection model
- Visual-language understanding of logos, brand strings, product details for the query product entity and for all brand representative product entities
- Text to image lookup and comparison
- Similar embedding lookup and comparison
- Crafted image caption is tokenized and encoded using a pre-trained text encoder: RoBERTa
- Image and textual features are concatenated as a multimodal vector and fed to a joint transformer encoder with cross attention between image and textual features
Keep Exploring!!!
No comments:
Post a Comment