[Summary] Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
TL;DR Many machine learning systems deal with multimodal data, however there’s no study examining different design choices across modalities. The paper surveys common “tricks” for multimodal systems and found the most effective techniques are: (i) basic strategies such as gradient clipping and learning-rate warmup, (ii) late fusion using pretrained unimodal encoders, (iii) auxiliary cross-modal alignment objectives, (iv) simple input-level augmentation, and (v) modality dropout and learnable embeddings for handling missing inputs....