[Summary] Learning to Prompt for Vision-Language Models
TL;DR Vision-language models (as CLIP) are frequently used as zero-shot classifiers: Given a text prompt, one can find the prompt similarity to image embeddings. Prompt engineering is able to improve this zero-shot classification significantly, however, it’s time consuming. The CoOp method suggests to have a learnable prompt (trained with a single sample .i.e. one shot) and by that get similar performance to human crafted prompts. By using 16 samples, they able to improve human created prompts by +15%....