Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Stable Diffusion Internals

December 25, 2023

Stable Diffusion Internals

Stable Diffusion Key Steps

Method of learning to generate new stuff - Forward/reverse diffusion
Way to link text and images - Text-image representation model
Way to compress images - Autoencoder
Way to add in good inductive biases - U-net architecture + ‘attention’

Build Stable Diffusion “from Scratch”

Principle of Diffusion models (sampling, learning)
Diffusion for Images – UNet architecture
Understanding prompts – Word as vectors, CLIP
Let words modulate diffusion – Conditional Diffusion, Cross Attention
Diffusion in latent space – AutoEncoderKL
Training on Massive Dataset. – LAION 5Billion

GAN

One shot generation. Fast.
Harder to control in one pass.
Adversarial min-max objective. Can collapse.

Diffusion

Multi-iteration generation. Slow.
Easier to control during generation.
Simple objective, no adversary in training.

Key Ingredients of UNet

Convolution operation
Save parameter, spatial invariant

Down/Up sampling

Multiscale / Hierarchy
Learn modulation at multi scale and multi-abstraction levels.

Skip connection

No bottleneck
Route feature of the same scaledirectly.
Cf. AutoEncoder has bottleneck

Autoencoder

Autoencoder - impose a bottleneck in the network which forces a compressed knowledge representation of the original input
An autoencoder can learn non-linear transformations with a non-linear activation function and multiple layers
An ideal autoencoder will learn descriptive attributes of faces such as skin color, whether or not the person is wearing glasses, etc. in an attempt to describe an observation in some compressed representation.
For variational autoencoders, the encoder model is sometimes referred to as the recognition model whereas the decoder model is sometimes referred to as the generative model

Applications of Autoencoders

Image Coloring, Feature variation, Dimensionality, Reduction, Denoising Image, Watermark Removal

PCA vs Autoencoder

PCA attempts to discover a lower dimensional hyperplane which describes the original data
Autoencoders are capable of learning nonlinear manifolds (a manifold is defined in simple terms as a continuous, non-intersecting surface)

ControlNet

ControlNet is a neural network structure to control diffusion models by adding extra conditions.
It copys the weights of neural network blocks into a "locked" copy and a "trainable" copy.
The "trainable" one learns your condition. The "locked" one preserves your model.

Keep Exploring!!!

No comments:

Subscribe to: Post Comments (Atom)