From CLIP to Diffusion Models: A Technical Overview

Reference: https://www.youtube.com/watch?v=iv-5mZ_9CPY

CLIP

In February 2021, OpenAI released a new model architecture called CLIP, trained on a dataset consisting of 400 million image-caption pairs.

CLIP consists of two models: a Text Encoder that processes text and an Image Encoder that processes images. Both models output 512-dimensional vectors (embeddings), and the outputs for the same data pair are trained to be as close as possible.

During batch training, to leverage the similarities between all images in the batch, the model calculates pairwise similarities between vectors.

The vectors on the diagonal represent matching image-caption pairs, and their similarity should be maximized.

All off-diagonal pairs are mismatched, and their similarity should be minimized.

The “C” in CLIP stands for Contrastive, because the model learns to contrast matching and non-matching image-caption pairs.

Cosine similarity is used to measure the degree of similarity between vectors

CLIP can only convert text and images into vector space; it cannot generate images from vectors in reverse.

DDPM → DDIM

A few weeks after the GPT-3 paper was published, a team from UC Berkeley published the Denoising Diffusion Probabilistic Models (DDPM) paper, which first demonstrated that diffusion processes could generate very high-quality images.

The core idea of diffusion models is to gradually add noise to training images until they become pure noise, then train a neural network to compute the reverse of this process.

Although noise is added and removed step by step, the model’s goal at each step is always to return to the original image.

During the image generation phase, random noise must also be added at each step. The DDPM paper borrowed some fairly complex theory to derive these algorithms.

Below, we introduce a mathematically equivalent method for understanding diffusion models, providing an intuitive understanding of why the DDPM algorithm works.

Suppose the training images form a spiral distribution in vector space.

When we add random noise to images, it’s equivalent to randomly changing each pixel value, similar to the corresponding vectors in vector space undergoing random walks, which is also analogous to Brownian motion in physics—hence the name “diffusion models.”

Each point in vector space represents an image.

Our model’s input consists of many different random walks seen from different starting points in the dataset. What our model needs to do is reverse time—given the position at time t, calculate the position at time t-1.

In the actual paper, the model is trained to predict the total noise added throughout the entire walk process.

Mathematically, it can be proven that the total noise divided by the number of walk steps equals the noise of the final step.

This effect greatly reduces the variance of the training process (all directly predicting noise pointing to the starting point), allowing the model to learn more effectively.

We write the coordinates as a function f and pass in a time variable corresponding to the number of random walk steps.

==When t decreases from 1 to a certain stage, the reverse vector field suddenly changes from pointing toward the dataset center (average) to pointing toward the dataset itself==

Returning to the DDPM paper, each step of reverse diffusion adds random noise (adding random step lengths to the model output position, with step lengths decreasing as the number of steps increases). By repeating this process, convergence to the dataset itself can be achieved.

If random noise is removed, all points will quickly move to the spiral center and then toward a single inner edge, losing the diversity of the entire dataset. However, in practice, our image generation doesn’t fully enter the manifold of real images, resulting in blurry, unrealistic images.

DDPM made diffusion models a viable method for image generation, but diffusion models weren’t immediately widely adopted.

The key problem with DDPM was that each step required passing through a complete, massive neural network, and many steps were needed.

Later, two papers published by Google and Stanford showed that the same results could be achieved without adding random noise.

Each point’s motion in DDPM can be expressed through a special differential equation—a “stochastic differential equation”—which contains two components: the model vector field and random motion
The Google Brain team discovered the Fokker-Planck equation, which can generate the same distribution as the stochastic differential equation but has no random component and is an “ordinary differential equation.” This allows direct use of the model-generated vector field to generate images without random perturbation. This method is called DDIM
DDIM requires no changes to model training and can generate high-quality images in fewer steps using a completely deterministic approach

DDIM uses smaller step sizes, allowing trajectories to better converge to the spiral distribution.

Improving Performance

In 2022, OpenAI used image-caption pairs to train diffusion models as the inverse of the CLIP image encoder, calling it unCLIP, commercially known as DALL-E 2.

The model can perform denoising based on text information, a technique called “Conditioning.”

There are multiple ways to input text vectors into diffusion models.

However, images drawn using only text input “Conditioning” cannot contain all the information from the text.

We need a stronger method to guide the model.

Suppose different parts of the spiral represent different types of images. When we add corresponding categories to the diffusion model training input, we find that the fit during runtime isn’t very high, and there’s some confusion between categories.

This is because the model is simultaneously learning to fit the spiral and fit the corresponding categories, and the force requiring points to match the spiral exceeds the force moving toward specific category directions.

One method for improving consistency between text and generated images is called ==Classifier-Free Guidance==. This method shows remarkable effectiveness and is now an important part of current image and video generation models. It works as follows:

==Final Prediction = Unconditional Prediction + Guidance Strength × (Conditional Prediction - Unconditional Prediction)==

$f(x, t)$: Basic unconditional noise prediction (without considering text prompts)
$f(x, t, prompt)$: Noise prediction with text prompts
$\alpha(f(x, t, prompt) - f(x, t))$: Guidance vector representing the difference between “with prompt” and “without prompt,” multiplied by guidance strength $\alpha$

There’s also the negative prompting approach. In addition to providing content you want in the image (positive prompts), you can explicitly tell the model what content you don’t want to appear. This works similarly to Classifier-Free Guidance:

$f(x, t, prompt)$: Results predicted with positive prompts, guiding generation toward what you want
$f(x, t, neg\ prompt)$: Results predicted with negative prompts, representing features you don’t want

From CLIP to Diffusion Models: A Technical Overview

CLIP

DDPM → DDIM

Improving Performance

Copyright Notice

Table of Contents

CLIP

DDPM → DDIM

Improving Performance

Copyright Notice

Start searching

No results found