[Summary] Object Recognition as Next Token Prediction
TL;DR Models for object classification require a fixed set of pre-defined classes which constrain the model from recognizing any object. In this paper, a visual classifier is trained to predict the most likely token of a pre-trained Large Language Model (LLM). Given that LLMs are trained on extensive textual data, training a model to predict across the entire token space allows it capture the full range of textual information. Methods The model is trained to predict the probability for each token of a pretrained LLM: Denote Xv as the visual features, W as the LLM token embeddings, and w represents the most probable single token, the model prediction is To guide the language decoder, the authors prompt it with “the objects in the image are” (Xp)....