Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Text to Vision - Image - Survey - Techniques

October 23, 2023

Text to Vision - Image - Survey - Techniques - Lessons

Text to Vision - Image - Survey - Techniques - Lessons

multimodal-to-text generation models (e.g. Flamingo)
image-text matching models (e.g. CLIP)
text-to-image generation models (e.g. Stable Diffusion).

A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions

Difficulty generating images with multiple objects
Quality improvement of generated images

Concepts

To generate images with multiple objects, layout information such as bounding boxes or segmentation maps is added to the model.
Cross-attention maps have been found to play a crucial role in image generation quality
Techniques like “SynGen” [55] and “Attend-and-Excite” [9] have been introduced to improve attention maps

Mathematically, this process can be modeled as a Markov process.
The process of adding noise step by step from X0 to XT is called “forward process” or “diffusion process”
Reversely, from XT , the process of iterative remove the noise until getting the clear image is called the “reverse process”

Denoising Diffusion Probabilistic Models (DDPM)
Basic components of a diffusion model
Noise prediction module - U-net / pure transformer structure
Condition encoder - conditioned on something, such as text. T5 series encoder or CLIP text encoder is used in most of the current works.
Super resolution module - DALL·E 2 employs two super-resolution models in its pipeline
Dimension reduction module - Text encoder and image encoder of CLIP are components integrated into the DALL·E 2 model
Diffusion models can also encounter difficulties in accurately representing positional information
SceneComposer [75] and SpaText [1] concentrate on leveraging segmentation maps for image synthesis
Subject Driven Generation
Concept customization or personalized generation
Present an image or a set of images that represent a particular concept, and then generate new images based on that specific concept

Advantage of Blip-diffusion lies in its ability to perform “zero-shot” generation, as well as “few-shot” generation with minimal fine-tuning

QUALITY IMPROVEMENT OF GENERATED IMAGES

Mixture of experts (MOE) [60] is a technique that leverages the strengths of different models, and it has been adapted for use in diffusion models to optimize their performance
Employ Gaussian blur on the on certain area of the prediction according to self-attention map to extract this condition

Reverse Stable Diffusion: What prompt was used to generate this image?

new task of predicting the text prompt given an image generated by a generative diffusion model
DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.
Diffusion Explorer

Learning framework for prompt embedding estimation

Reversing the textto-image diffusion process

Predict a sentence embedding of the original prompt used to generate the input image
As underlying models, we consider three state-of-the-art architectures
that are agnostic to the generative mechanism of Stable Diffusion, namely ViT, CLIP and Swin Transformer
U-Net model from Stable Diffusion, which operates in the latent space.

Explain in laymen terms - U-Net model from Stable Diffusion, which operates in the latent space.

The U-Net model from Stable Diffusion is a type of artificial intelligence model used for various computer vision tasks like image segmentation, where it identifies and separates different objects or features within an image.

Imagine you have a picture that appears blurry, full of noise, or unclear. The U-Net model from Stable Diffusion operates like a sophisticated visual detective, which can work back through the noise, step by step, to try and reconstruct the original picture.

To do this, it operates in what we call the 'latent space', which is loosely analogous to the mind’s eye of the AI - it's where the AI forms a sort of abstract, compressed understanding of the different elements present in the image, their shapes, and how they relate to each other. You can think of the latent space as a box where the details of the image are stored in a compact form, almost like the raw components before they've been assembled into the complete picture.

So, the U-Net model from Stable Diffusion first takes a noisy image, maps or translates it into this intermediate latent space - compressing and organizing the information in a way it can handle - before then reconstructing the original, clearer image from that. It's essentially a way of moving from a jumble of details,into a structured "blueprint" in the latent space, and then using that blueprint to rebuild a clear and accurate image.

A key aspect of the U-Net model is its structure, which is like a U-shape (thus the name 'U-net'). The first half of the U shape takes the noisy image and condenses it down into the blueprint in the latent space (this is called encoding or downsampling). The second half then expands this blueprint back out into the clear image (known as decoding or upsampling). This U-shape structure, combined with the operation in the latent space, allows the model to effectively manage and recover the important details from the noisy input and improve the generated output's quality significantly.

So in simple terms, the U-Net model from Stable Diffusion operates like a skilled restorer, turning a distorted or noisy picture back into a clear and identifiable image by operating in its “mind’s eye” or latent space, using a special U-shaped structure to carefully manage detail extraction and restoration.

Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion

VLP: A Survey on Vision-Language Pre-training

Image Feature Extraction

By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding
CNNs end-to-end by using the grid features

Video Feature Extraction

VLP models [17, 18] extract the frame features by using the method mentioned above

Text Feature Models

For the textual features, following pretrained language model such as BERT [2], RoBERTa [24], AlBERT [25], and XLNet [26], VLP models [9, 27, 28] first segment the input sentence into a sequence of subwords

Ideas summary

Detection models for Region feature embedding
Grid based feature extraction with CNN
Super resolution module to the pipeline
Subject Driven Generation, Concept customization or personalized generation - Present an image or a set of images that represent a particular concept, and then generate new images based on that specific concept
Gaussian blur on the on certain area based on attention / relevance
Captioning, Category recognition
Category Recognition (CR) CR refers to identifying the category and sub-category of a product, such as {HOODIES, SWEATERS}, {TROUSERS, PANTS}
Multi-modal Sentiment Analysis (MSA) MSA is aimed to detect sentiments in videos by leveraging multi-modal signals

Text-to-image Diffusion Models in Generative AI: A Survey

The learning goal of DM is to reserve a process of perturbing the data with noise, i.e. diffusion, for sample generation

Diffusion Probabilistic Models (DPM), Score-based Generative model(SGM)

Denoising diffusion probabilistic models (DDPMs) are defined as a parameterized Markov chain

Forward pass. In the forward pass, DDPM is a Markov chain where Gaussian noise is added to data in each step until the images are destroyed
Reverse pass. With the forward pass defined above, we can train the transition kernels with a reverse process

Conditional diffusion model: A conditional diffusion model learns from additional information (e.g., class and text) by taking them as model input.

Guided diffusion model: During the training of a guided diffusion model, the class-induced gradients (e.g. through an auxiliary classfier) are involved in the sampling process.

Awesome Video Diffusion

Keep Exploring!!!

Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database)

October 23, 2023

Text to Vision - Image - Survey - Techniques - Lessons

No comments:

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts