[Summary] Object Recognition as Next Token Prediction

TL;DR Models for object classification require a fixed set of pre-defined classes which constrain the model from recognizing any object. In this paper, a visual classifier is trained to predict the most likely token of a pre-trained Large Language Model (LLM). Given that LLMs are trained on extensive textual data, training a model to predict across the entire token space allows it capture the full range of textual information. Methods The model is trained to predict the probability for each token of a pretrained LLM: Denote Xv as the visual features, W as the LLM token embeddings, and w represents the most probable single token, the model prediction is To guide the language decoder, the authors prompt it with “the objects in the image are” (Xp)....

April 23, 2024 · 2 min · 267 words

[Summary] Control Net: Adding Conditional Control to Text-to-Image Diffusion Models

TL;DR Control Net is a framework designed to control the content of images generated by diffusion models. The process involves taking a trained diffusion model, freezing its weights, cloning some of its building blocks, and training the cloned weights with a conditioning input image. Method Architecture. Given a trainable diffusion model, the Control Net model is created by: Freezing the parameters of the original model. Cloning some of the original model blocks to a trainable copy....

March 2, 2024 · 2 min · 316 words

[Summary] RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

TL;DR The process of video editing can be time-consuming and laborious. Many diffusion-based models for videos either fail to preserve temporal consistency or require significant resources. To address it, the “RAVE” method incorporates a clever trick: it takes video frames and combines them to a “grid image”. Then, the grid image is fed to a diffusion model (+ControlNet) to produce an edited version of the grid image. Reconstructing the video from the edited grid image results in a consistent edited temporal video....

January 6, 2024 · 2 min · 422 words