[Summary] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

TL;DR Existing diffusion models for sequence generation have two main limitations: They either generate sequences one token at a time, without the ability to steer the sampling process towards desired outcomes, or diffuse the entire sequence iteratively but are constrained to a fixed sequence length. Diffusion Forcing combines the benefits of both approaches by diffusing the entire sequence iteratively with independent per-token noise levels that are conditioned on the previous token in the sequence....

July 21, 2024 · 2 min · 387 words

[Summary] Control Net: Adding Conditional Control to Text-to-Image Diffusion Models

TL;DR Control Net is a framework designed to control the content of images generated by diffusion models. The process involves taking a trained diffusion model, freezing its weights, cloning some of its building blocks, and training the cloned weights with a conditioning input image. Method Architecture. Given a trainable diffusion model, the Control Net model is created by: Freezing the parameters of the original model. Cloning some of the original model blocks to a trainable copy....

March 2, 2024 · 2 min · 316 words

[Summary] RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

TL;DR The process of video editing can be time-consuming and laborious. Many diffusion-based models for videos either fail to preserve temporal consistency or require significant resources. To address it, the “RAVE” method incorporates a clever trick: it takes video frames and combines them to a “grid image”. Then, the grid image is fed to a diffusion model (+ControlNet) to produce an edited version of the grid image. Reconstructing the video from the edited grid image results in a consistent edited temporal video....

January 6, 2024 · 2 min · 422 words