[Summary] Vision Language Model are Blinds

TL;DR The recent trend is to equip Large Language models with vision capabilities and creating Visual Language models (VLM). However, it’s unclear how well VLMs perform on simple vision tasks. This paper introduces “BlindTest”, a benchmark of 7 simple tasks, such as identifying overlapping circles, intersecting lines, and circled letters. The results show that VLMs achieve only 58.57% accuracy on average, far from the expected human accuracy of 100%. Task example The paper aims to investigate how VLMs perceive simple images composed of basic geometric shapes....

August 17, 2024 · 2 min · 404 words