Text to Vision - Image - Survey - Techniques - Lessons
- multimodal-to-text generation models (e.g. Flamingo)
- image-text matching models (e.g. CLIP)
- text-to-image generation models (e.g. Stable Diffusion).
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
- Difficulty generating images with multiple objects
- Quality improvement of generated images
Concepts
- To generate images with multiple objects, layout information such as bounding boxes or segmentation maps is added to the model.
- Cross-attention maps have been found to play a crucial role in image generation quality
- Techniques like “SynGen” [55] and “Attend-and-Excite” [9] have been introduced to improve attention maps
- Mathematically, this process can be modeled as a Markov process.
- The process of adding noise step by step from X0 to XT is called “forward process” or “diffusion process”
- Reversely, from XT , the process of iterative remove the noise until getting the clear image is called the “reverse process”
- Denoising Diffusion Probabilistic Models (DDPM)
- Basic components of a diffusion model
- Noise prediction module - U-net / pure transformer structure
- Condition encoder - conditioned on something, such as text. T5 series encoder or CLIP text encoder is used in most of the current works.
- Super resolution module - DALL·E 2 employs two super-resolution models in its pipeline
- Dimension reduction module - Text encoder and image encoder of CLIP are components integrated into the DALL·E 2 model
- Diffusion models can also encounter difficulties in accurately representing positional information
- SceneComposer [75] and SpaText [1] concentrate on leveraging segmentation maps for image synthesis
- Subject Driven Generation
- Concept customization or personalized generation
- Present an image or a set of images that represent a particular concept, and then generate new images based on that specific concept
- Advantage of Blip-diffusion lies in its ability to perform “zero-shot” generation, as well as “few-shot” generation with minimal fine-tuning
QUALITY IMPROVEMENT OF GENERATED IMAGES
- Mixture of experts (MOE) [60] is a technique that leverages the strengths of different models, and it has been adapted for use in diffusion models to optimize their performance
- Employ Gaussian blur on the on certain area of the prediction according to self-attention map to extract this condition
Reverse Stable Diffusion: What prompt was used to generate this image?
- new task of predicting the text prompt given an image generated by a generative diffusion model
- DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.
- Diffusion Explorer
Learning framework for prompt embedding estimation
Reversing the textto-image diffusion process
- Predict a sentence embedding of the original prompt used to generate the input image
- As underlying models, we consider three state-of-the-art architectures
- that are agnostic to the generative mechanism of Stable Diffusion, namely ViT, CLIP and Swin Transformer
- U-Net model from Stable Diffusion, which operates in the latent space.
Explain in laymen terms - U-Net model from Stable Diffusion, which operates in the latent space.
The U-Net model from Stable Diffusion is a type of artificial intelligence model used for various computer vision tasks like image segmentation, where it identifies and separates different objects or features within an image.
Imagine you have a picture that appears blurry, full of noise, or unclear. The U-Net model from Stable Diffusion operates like a sophisticated visual detective, which can work back through the noise, step by step, to try and reconstruct the original picture.
To do this, it operates in what we call the 'latent space', which is loosely analogous to the mind’s eye of the AI - it's where the AI forms a sort of abstract, compressed understanding of the different elements present in the image, their shapes, and how they relate to each other. You can think of the latent space as a box where the details of the image are stored in a compact form, almost like the raw components before they've been assembled into the complete picture.
So, the U-Net model from Stable Diffusion first takes a noisy image, maps or translates it into this intermediate latent space - compressing and organizing the information in a way it can handle - before then reconstructing the original, clearer image from that. It's essentially a way of moving from a jumble of details,into a structured "blueprint" in the latent space, and then using that blueprint to rebuild a clear and accurate image.
A key aspect of the U-Net model is its structure, which is like a U-shape (thus the name 'U-net'). The first half of the U shape takes the noisy image and condenses it down into the blueprint in the latent space (this is called encoding or downsampling). The second half then expands this blueprint back out into the clear image (known as decoding or upsampling). This U-shape structure, combined with the operation in the latent space, allows the model to effectively manage and recover the important details from the noisy input and improve the generated output's quality significantly.
So in simple terms, the U-Net model from Stable Diffusion operates like a skilled restorer, turning a distorted or noisy picture back into a clear and identifiable image by operating in its “mind’s eye” or latent space, using a special U-shaped structure to carefully manage detail extraction and restoration.
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
VLP: A Survey on Vision-Language Pre-training
Image Feature Extraction
- By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding
- CNNs end-to-end by using the grid features
Video Feature Extraction
- VLP models [17, 18] extract the frame features by using the method mentioned above
Text Feature Models
- For the textual features, following pretrained language model such as BERT [2], RoBERTa [24], AlBERT [25], and XLNet [26], VLP models [9, 27, 28] first segment the input sentence into a sequence of subwords
Ideas summary
- Detection models for Region feature embedding
- Grid based feature extraction with CNN
- Super resolution module to the pipeline
- Subject Driven Generation, Concept customization or personalized generation - Present an image or a set of images that represent a particular concept, and then generate new images based on that specific concept
- Gaussian blur on the on certain area based on attention / relevance
- Captioning, Category recognition
- Category Recognition (CR) CR refers to identifying the category and sub-category of a product, such as {HOODIES, SWEATERS}, {TROUSERS, PANTS}
- Multi-modal Sentiment Analysis (MSA) MSA is aimed to detect sentiments in videos by leveraging multi-modal signals
Text-to-image Diffusion Models in Generative AI: A Survey
The learning goal of DM is to reserve a process of perturbing the data with noise, i.e. diffusion, for sample generation
Diffusion Probabilistic Models (DPM), Score-based Generative model(SGM)
Denoising diffusion probabilistic models (DDPMs) are defined as a parameterized Markov chain
- Forward pass. In the forward pass, DDPM is a Markov chain where Gaussian noise is added to data in each step until the images are destroyed
- Reverse pass. With the forward pass defined above, we can train the transition kernels with a reverse process
Conditional diffusion model: A conditional diffusion model learns from additional information (e.g., class and text) by taking them as model input.
Guided diffusion model: During the training of a guided diffusion model, the class-induced gradients (e.g. through an auxiliary classfier) are involved in the sampling process.
Awesome Video Diffusion
Keep Exploring!!!