[Summary] Object Recognition as Next Token Prediction

TL;DR Models for object classification require a fixed set of pre-defined classes which constrain the model from recognizing any object. In this paper, a visual classifier is trained to predict the most likely token of a pre-trained Large Language Model (LLM). Given that LLMs are trained on extensive textual data, training a model to predict across the entire token space allows it capture the full range of textual information. Methods The model is trained to predict the probability for each token of a pretrained LLM: Denote Xv as the visual features, W as the LLM token embeddings, and w represents the most probable single token, the model prediction is To guide the language decoder, the authors prompt it with “the objects in the image are” (Xp)....

April 23, 2024 · 2 min · 267 words

[Summary] Control Net: Adding Conditional Control to Text-to-Image Diffusion Models

TL;DR Control Net is a framework designed to control the content of images generated by diffusion models. The process involves taking a trained diffusion model, freezing its weights, cloning some of its building blocks, and training the cloned weights with a conditioning input image. Method Architecture. Given a trainable diffusion model, the Control Net model is created by: Freezing the parameters of the original model. Cloning some of the original model blocks to a trainable copy....

March 2, 2024 · 2 min · 316 words

[Summary] RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

TL;DR The process of video editing can be time-consuming and laborious. Many diffusion-based models for videos either fail to preserve temporal consistency or require significant resources. To address it, the “RAVE” method incorporates a clever trick: it takes video frames and combines them to a “grid image”. Then, the grid image is fed to a diffusion model (+ControlNet) to produce an edited version of the grid image. Reconstructing the video from the edited grid image results in a consistent edited temporal video....

January 6, 2024 · 2 min · 422 words

[Summary] Break-A-Scene: Extracting Multiple Concepts from a Single Image

TL;DR Fine-tuning of a diffusion model using a single image to generate images conditions on user-provided concepts. Problem statements Diffusion models are not able to generate a new image of user-provided concepts. Methods (DreemBooth) that enable this capabilities require several input images that contain the desired concept. Method The method consists of two phases. Freezing the model weights, and optimize handles to reconstruct the input image. This is done with a large learning rate to not harm the model generalization....

July 21, 2023 · 2 min · 340 words