TL;DR
To enable a more controllable image diffusion, MultiDiffusion introduce patches generation with a global constrain.
Problem statements
Diffusion models lack user controllability and methods that offer such control require a costly fine-tuning.
Method
The method can be reduced to the following algorithm: At each time step t:
- Extract patches from the global image I_{t-1}
- Execute the de-noising step to generate the patches J_{i,t}
- Combine the patches by average their pixel values to create the global image I_t
For the panorama use case: simply generate N images with overlapping regions between them. At each de-noising step, take the average pixel values of the overlapping regions.
Limitations
The compute cost is linear in the number of patches (each additional patch require diffusion model inference)