Koby Bibas

[Summary] Learning to Prompt for Vision-Language Models

TL;DR Vision-language models (as CLIP) are frequently used as zero-shot classifiers: Given a text prompt, one can find the prompt similarity to image embeddings. Prompt engineering is able to improve this zero-shot classification significantly, however, it’s time consuming. The CoOp method suggests to have a learnable prompt (trained with a single sample .i.e. one shot) and by that get similar performance to human crafted prompts. By using 16 samples, they able to improve human created prompts by +15%....

[Summary] Control Net: Adding Conditional Control to Text-to-Image Diffusion Models

TL;DR Control Net is a framework designed to control the content of images generated by diffusion models. The process involves taking a trained diffusion model, freezing its weights, cloning some of its building blocks, and training the cloned weights with a conditioning input image. Method Architecture. Given a trainable diffusion model, the Control Net model is created by: Freezing the parameters of the original model. Cloning some of the original model blocks to a trainable copy....

[Summary] RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

TL;DR The process of video editing can be time-consuming and laborious. Many diffusion-based models for videos either fail to preserve temporal consistency or require significant resources. To address it, the “RAVE” method incorporates a clever trick: it takes video frames and combines them to a “grid image”. Then, the grid image is fed to a diffusion model (+ControlNet) to produce an edited version of the grid image. Reconstructing the video from the edited grid image results in a consistent edited temporal video....

[Summary] Direct Preference Optimization (DPO)

TL;DR Direct Preference Optimization is a method of fine-tuning Large Language Models (LLM) to better align their outputs with human preference. It’s used as a simpler alternative to RLHF since it can be directly applied to the model without needing a reward function nor reinforcement learning optimization. Method The authors propose to re-parameterize the reward model of RLHF to obtain the optimal policy in closed form. This enables to solve the standard RLHF problem using a simple classification loss....

[Concept] Reinforcement learning from human feedback (RLHF)

TL;DR Machine learning models require a loss function to tune their parameters. Designing a loss function to reflect ambiguous human values poses a challenge, e.g., it’s not clear how to formulate a loss function to represent what is funny or ethical. To this end, a reward model is trained via human feedback. This reward model takes the model’s output and predicts a reward score that is then used by the model to optimize its parameters....

[Proof-of-Concept] DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion

TL;DR Typical diffusion models create images using input text. DreamPose, presented at ECCV 2023, enhances this functionality by generating a video from an image incorporating a human model and pose sequence, as represented by DensePose. Problem statements Common diffusion models able to generate images based on given text. However, they can not produce animated sequence nor able to be conditioned on an input pose sequence. Method Apply the following modifications to a diffusion model:...

[Summary] CoDeF: Content Deformation Fields for Temporally Consistent Video Processing

TL;DR A new video representation by (i) a canonical image that aggregates the static contents and (ii) a temporal deformation field that reconstructs the video frames when applied to the static image. Problem statements Video processing comes at a high cost,and naively processing frames results in poor cross-frame consistency. Method High level objective. The proposed representations should have the following characteristics: Fitting capability for faithful video reconstruction. Semantic correctness of the canonical image to ensure the performance of image processing algorithms....

[Summary] Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

TL;DR This work enables interactive editing of a GAN’s generated image by translating (“dragging”) any point in the image to a target location. Problem statements GAN based image generation takes a noise vector to generate an image. There is a need of a localized controlled image manipulation as moving a region to a different location in the image. Method Given a GAN generated image, a user input of the source coordinates (q) and the coordinates of the destination (p)...

[Summary] Break-A-Scene: Extracting Multiple Concepts from a Single Image

TL;DR Fine-tuning of a diffusion model using a single image to generate images conditions on user-provided concepts. Problem statements Diffusion models are not able to generate a new image of user-provided concepts. Methods (DreemBooth) that enable this capabilities require several input images that contain the desired concept. Method The method consists of two phases. Freezing the model weights, and optimize handles to reconstruct the input image. This is done with a large learning rate to not harm the model generalization....

[Summary] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

TL;DR To enable a more controllable image diffusion, MultiDiffusion introduce patches generation with a global constrain. Problem statements Diffusion models lack user controllability and methods that offer such control require a costly fine-tuning. Method The method can be reduced to the following algorithm: At each time step t: Extract patches from the global image I_{t-1} Execute the de-noising step to generate the patches J_{i,t} Combine the patches by average their pixel values to create the global image I_t For the panorama use case: simply generate N images with overlapping regions between them....