TL;DR

Control Net is a framework designed to control the content of images generated by diffusion models. The process involves taking a trained diffusion model, freezing its weights, cloning some of its building blocks, and training the cloned weights with a conditioning input image.

examples

Method

Architecture. Given a trainable diffusion model, the Control Net model is created by:

  1. Freezing the parameters of the original model.
  2. Cloning some of the original model blocks to a trainable copy. The trainable copy takes an external conditioning vector c as input.
  3. Connecting the trainable copy to the locked model with zero initialized convolution layers.

architecture

Training process. Control Net requires training for every new input conditioning signals: Given an input image z0, image diffusion algorithms progressively add noise to the image and produce a noisy image zt, where t represents the number of times noise is added. Given a set of conditions including time step t, text prompts ct, as well as a task-specific condition cf, image diffusion algorithms learn a network ϵθ to predict the noise added to the noisy image zt with training_loss

Some small details:

  • In the training process, the text prompts are sometimes replaced with empty strings. This better aligns the generated images to the input conditioning images (e.g., edges, poses, depth, etc.) as a replacement for the prompt.
  • Since the original parameters are frozen, no gradient computation is required which saves CUDA memory.
  • For Stable Diffusion (as a latent diffusion model), there’s a need to first convert each input conditioning image from an input size of 512 × 512 into a 64 × 64 feature space vector that matches the size of Stable Diffusion. This is done a model consists of 4 layers.

Limitations

  • Control Net is x2 the size of the pre-trained diffusion model. This requires more GPU memory.
  • There’s a need of train the copied model for each new conditioning input image.

Resource

Arxiv

GitHub