[Summary] Unifying Generative and Dense Retrieval for Sequential Recommendation

TL;DR Traditional item retrieval methods use user and item embeddings to predict relevance via inner product computation, which is not scalable for large systems. Generative models predict item indices directly but struggle with new items. This work proposes a hybrid model that combines item positions, text representations, and semantic IDs to predict both the next item embedding and several possible next item IDs. Then only this item subset along the new items are in the inner product with user representations....

January 4, 2025 · 2 min · 367 words

[Summary] The Evolution of Multimodal Model Architectures

TL;DR Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types. Types A and B integrate multimodal data within the internal layers of the model. Type A relies on standard cross-attention for fusion Type B introduces custom-designed layers for multimodal fusion Types C and D fuse multimodal at the input stage (early fusion) Type C uses modality-specific encoders without tokenization Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models) Model Architecture Overview Models processing images, audio, or video alongside text have evolved significantly....

November 1, 2024 · 3 min · 427 words