[Summary] The Evolution of Multimodal Model Architectures
TL;DR Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types. Types A and B integrate multimodal data within the internal layers of the model. Type A relies on standard cross-attention for fusion Type B introduces custom-designed layers for multimodal fusion Types C and D fuse multimodal at the input stage (early fusion) Type C uses modality-specific encoders without tokenization Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models) Model Architecture Overview Models processing images, audio, or video alongside text have evolved significantly....