[Summary] The Evolution of Multimodal Model Architectures

TL;DR Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types. Types A and B integrate multimodal data within the internal layers of the model. Type A relies on standard cross-attention for fusion Type B introduces custom-designed layers for multimodal fusion Types C and D fuse multimodal at the input stage (early fusion) Type C uses modality-specific encoders without tokenization Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models) Model Architecture Overview Models processing images, audio, or video alongside text have evolved significantly....

November 1, 2024 · 3 min · 427 words

[Summary] Vision Language Model are Blinds

TL;DR The recent trend is to equip Large Language models with vision capabilities and creating Visual Language models (VLM). However, it’s unclear how well VLMs perform on simple vision tasks. This paper introduces “BlindTest”, a benchmark of 7 simple tasks, such as identifying overlapping circles, intersecting lines, and circled letters. The results show that VLMs achieve only 58.57% accuracy on average, far from the expected human accuracy of 100%. Task example The paper aims to investigate how VLMs perceive simple images composed of basic geometric shapes....

August 17, 2024 · 2 min · 404 words

CVPR 2024 Summary

Last week I attended the CVPR conference, a gathering of computer vision researchers and professionals showcasing the latest advancements in the field. Some interesting recent trends: Multimodal models and datasets Large Language Models (LLMs) are being used to train vision models Images are used to ground LLMs, reducing their hallucination Models are being fed with both images and videos to achieve better results Foundation models are commodity These models are becoming more accessible and less expensive to create They are trained on multiple modalities and tasks (even for a very niche tasks like hand pose estimation) Transformers are everywhere: While not a new trend, it’s still notable that attention mechanisms are incorporated into almost every model....

June 29, 2024 · 8 min · 1572 words