[Summary] The Evolution of Multimodal Model Architectures

TL;DR

Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types.

Types A and B integrate multimodal data within the internal layers of the model.
1. Type A relies on standard cross-attention for fusion
2. Type B introduces custom-designed layers for multimodal fusion
Types C and D fuse multimodal at the input stage (early fusion)
1. Type C uses modality-specific encoders without tokenization
2. Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models)

Model Architecture Overview

Models processing images, audio, or video alongside text have evolved significantly. Recent developments have emphasized image-text integration across diverse vision-language tasks, resulting in varied multimodal architectures based on the fusion method.

Type A: Standard Cross-attention based Deep Fusion (SCDF)

Type A

Uses a pre-trained language model (LLM)
Multimodal inputs (images/audio/video) pass through modality-specific encoders
A resampler generates embeddings suitable the requirements of the decoder layer
Cross-attention layers facilitate deep fusion of multimodal data within the model’s layers

Examples of models in this category: Flamingo, OpenFlamingo, Otter (trained on MIMIC-IT dataset on top of OpenFlamingo), MultiModal-GPT (derived from OpenFlamingo), PaLI-X, IDEFICS (open-access reproduction of Flamingo), Dolphins (based on OpenFlamingo architecture), VL-BART, and VL-T5.

Type B: Custom Layer based Deep Fusion (CLDF)

Type B

Employs a pre-trained LLM
Modality-specific encoders process the multimodal input
Custom cross-attention layers, such as learnable linear layers or MLPs, integrate multimodal data in the model’s internal layers
Some implementations include a learnable gating factor to control cross-attention contributions

Examples of models in this category: LLaMA-Adapter, LLaMA-Adapter-V2 G, CogVLM, mPLUG-Owl2, CogAgent, InternVL, MM-Interleaved, CogCoM, InternLM-XComposer2, MoE-LLaVA, and LION.

Type C: Non-Tokenized Early Fusion (NTEF)

Type C

This approach stands out as the most widely adopted multimodal model architecture.

A pretrained LLM acts as the decoder without architecture modifications.
Modality-specific encoders are employed.
Multimodal fusion occurs at the input stage (without tokenization, a non discrete inputs are fed to the model).

Training procedure:

Pretraining: Freeze LLM and Encoder. Only train the projection layer/s for Vision-Language alignment.
Instruction and alignment tuning: Train projection layer and LLM for multimodal tasks.

Examples of models in this category: LLaVA, PaLM-E, MiniGPT-v2, BLIP-2, MiniGPT-4, Video-LLaMA, defics2, MM1

Type D: Tokenized Early Fusion (TEF)

Type D

Models of this type are trained auto-regressively to generate text tokens along with other modalities tokens (image, audio, and videos).

Uses a pretrained LLM as the decoder
Modality-specific encoders are applied
Tokenizes all modalities at the input (discrete tokens)
Capable of generating outputs in multiple modalities

Examples of models in this category: LaVIT, TEAL, CM3Leon, VL-GPT, Unicode, SEED, 4M, Unified-IO, Unified-IO-2.

Resource

arxiv

TL;DR#

Model Architecture Overview#

Type A: Standard Cross-attention based Deep Fusion (SCDF)#

Type B: Custom Layer based Deep Fusion (CLDF)#

Type C: Non-Tokenized Early Fusion (NTEF)#

Type D: Tokenized Early Fusion (TEF)#

Resource#