- Automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data
- Focus on text-guided pretraining
- Textual supervision to guide the training of the features
- PCA between the patches of the images from the same column
- Features are learned from images alone
- Self-supervised learning has the potential to learn all-purposed visual features if pretrained on a large quantity of curated data
- Automatic pipeline to filter and rebalance datasets from an extensive collection of uncurated images
- Data similarities are used instead of external metadata and do not require manual annotation
Other Approaches
- Extracting a signal from the image to be predicted from the rest of the image
- Discriminative signals between images or groups of images to learn features.
- Copy detection pipeline of Pizzi et al. (2022) to the uncurated data and remove near-duplicate images
- Compute an image embedding using a self-supervised ViT-H/16 network pretrained on ImageNet-22k, and use cosine-similarity as a distance measure between images.
- k-means clustering of the uncurated data.
- Query dataset for retrieval, if it is large enough we retrieve N (typically 4) nearest neighbors for each query image.
Summary
- DINOv2, a new series of image encoders pretrained on large curated data with no supervision
- Visual features are compatible with classifiers as simple as linear layers - meaning the underlying information is readily available
Keep Exploring!!!
No comments:
Post a Comment