TL;DR

Fine-tuning of a diffusion model using a single image to generate images conditions on user-provided concepts.

Generated image example

Problem statements

Diffusion models are not able to generate a new image of user-provided concepts. Methods (DreemBooth) that enable this capabilities require several input images that contain the desired concept.

Method

The method consists of two phases.

  1. Freezing the model weights, and optimize handles to reconstruct the input image. This is done with a large learning rate to not harm the model generalization.
  2. Fine-tuning the model weights while continuing to optimize the handles. This is done with a small learning rate that enables faithful reconstruction of the extracted concepts.

In addition, the authors incorporate the following losses: Union sampling: The resulting model struggles to generate images the combine several concepts. To address it, they train the model with prompts like: “a photo of [v_1] and . . . [v_2]”.

To ensure each handle is associated with only a single concept they combine two losses:

  1. Masked loss: The handles and model weights are optimized using a masked version of the diffusion loss, i.e., by penalizing only over the pixels covered by the concept masks. This makes sure each handle is associated with one mask (eq. 1 in the paper).
  2. Attention loss: They utilize the cross-attention maps for the newly-added tokens and penalize their MSE deviation from the input masks.

Image generation process

Limitations

Personalization process is slow: each image requires training of the diffusion model (from the paper “our method takes about 4.5 minutes to extract the concepts from a single scene and to fine-tune the entire model.”)

The style of the image is indistinguishable from the objects. From the authors’ example: when an image with daylight was introduced, all generated images included daylight.

From the paper examples, seems like the pose is fixed: The dog in figure 10 is always in the same pose.

In their evaluation, they used COCO annotated segmentation masks. It is not clear how much noisy segmentation masks affect the performance. This is important for a large scale image generation.

Resource

https://arxiv.org/abs/2305.16311