Posts

[Summary] Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

TL;DR Many machine learning systems deal with multimodal data, however there’s no study examining different design choices across modalities. The paper surveys common “tricks” for multimodal systems and found the most effective techniques are: (i) basic strategies such as gradient clipping and learning-rate warmup, (ii) late fusion using pretrained unimodal encoders, (iii) auxiliary cross-modal alignment objectives, (iv) simple input-level augmentation, and (v) modality dropout and learnable embeddings for handling missing inputs....

[Concept] Inside Transformer Attention

Attention Layer Attention blocks are the backbone of the Transformer architecture, enabling the model to capture dependencies across the input sequence. An attention layer takes as input: A query vector \(q \in \mathbb{R}^d\) A matrix of keys \(K \in \mathbb{R}^{n \times d}\) (rows are \(k_i^\top\)) A matrix of values \(V \in \mathbb{R}^{n \times d_v}\) In the vanilla Transformer setup, the query, key, and value come from the same token embedding \(x\) but the model is free to learn different subspaces for “asking” (queries), “addressing” (keys), and “answering” (values):...

[Summary] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

TL;DR Machine learning model evaluation commonly reports the “highest number” often lacking any kind of statistical significance. This creates misleading comparisons, especially when differences between models are small. The paper overviews the different methods of adding statistical error bars to evals, covering independent and clustered questions, paired model comparisons, and power analysis. These tools help quantify uncertainty and avoid overconfident claims about which model is better. Motivation LLM evals often treat the top score as definitive....

[Summary] MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

TL;DR Recent advancements in vector retrieval demonstrate that mapping queries and documents to multiple vectors and performing multi-vector retrieval surpasses common single-vector retrieval methods. However, multi-vector retrieval is computationally intensive as it does not use linear operations like the inner product. MUVERA is a method designed to accelerate multi-vector retrieval: It converts sets of vectors into single Fixed Dimensional Encodings (FDEs). The inner product of two FDEs approximates the Chamfer Similarity score, allowing standard, highly-optimized Maximum Inner Product Search (MIPS) solvers to generate a candidate list, following by exact similarity calculation on this small set of candidates for precise ranking....

[Summary] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective

TL;DR Popular methods for chain‑of‑thought (CoT) reasoning (e.g supervised fine‑tuning, Tree‑of‑Thoughts) have three challenges: (i) distribution drift where small mistakes spiral with no recovery mechanism, (ii) missing search structure such that there’s no built-in exploration or backtracking, (iii) and explosive computational cost. The proposed Diligent Learner models reasoning as depth-first search guided by a validator. It is trained by building reasoning paths step-by-step, checks each one for correctness, and backtracks when needed....

[Summary] Ada-R1: Hybrid CoT via Bi-Level Adaptive Reasoning Optimization

TL;DR Chain-of-Thought (CoT) enables large language models (LLMs) to solve complex tasks by generating intermediate reasoning steps. Ada-R1 approach fine-tunes a model to prefer Short-CoT over Long-CoT based on problem complexity, using training a model to minimize reasoning length while preserving accuracy. This approach reduces average reasoning length by over 50%, substantially lowering inference cost, with maintained accuracy across five mathematical reasoning benchmarks. Background CoT prompting decomposes complex tasks into intermediate reasoning steps....

[Summary] LettuceDetect: A Hallucination Detection Framework for RAG Applications

TL;DR Retrieval-Augmented Generation (RAG) grounds large-language-model (LLM) answers in external documents, yet hallucinations persist. Existing detectors either rely on expensive LLM as a judges or on encoder classifiers that truncate context and lose evidence. LettuceDetect introduces a long-context, token-level classifier built on ModernBERT. It surpasses prior encoder baselines while remaining markedly more efficient than LLM-based judges. Background LLMs hallucinate when generated claims are not supported by retrieved context. Encoder detectors shorten inputs to fit model limits (context size), reducing recall, whereas generative judges process full context but incur high latency and cost....

[Summary] On the Biology of a Large Language Model

TL;DR Large Language Models (LLMs) are often perceived as “black boxes,” making their decision-making and reasoning processes difficult to interpret. A novel method simplifies these complex models by replacing internal nonlinear layers with linear modules tailored to clearly understandable features. This approach reveals structured reasoning, planning behaviors, and even hidden intentions within the model’s computations. Method Interpreting LLMs is challenging because individual neurons often represent multiple, unrelated concepts simultaneously (polysemanticity). To address this, the approach creates a simplified “replacement model”, preserving most of the original model’s performance while enhancing interpretability through these steps:...

[Summary] VGGT: Visual Geometry Grounded Transformer

TL;DR Traditional 3D reconstruction relied on iterative visual-geometry optimization (e.g., Bundle Adjustment). Recent work explored integrating machine learning via differentiable Bundle Adjustment, but remained slow and limited. VGGT (Visual Geometry Grounded Transformer) is a large feed-forward transformer that predicts all key 3D scene attributes—camera parameters, depth maps, point maps, and 3D point tracks—directly from one or many images in a single forward pass. It removes the need for geometry processing, achieves state-of-the-art results in multiple benchmarks, and runs in under a second....

[Summary] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TL;DR Transformer models often map multiple concepts to the same neuron, making it unclear what features they learn. This work makes inner representations interpretable by using a sparse autoencoder layer to map neurons to concepts. This method extracts relatively monosemantic concepts, can steer transformer generation, and shows that 512 neurons can represent tens of thousands of features. Method A major challenge in reverse engineering neural networks is the curse of dimensionality: as models grow, the latent space volume increases exponentially....

[Summary] Relightable Gaussian Codec Avatars

TL;DR Photorealistic head avatars is a required technology for virtual and augmented reality. However, current approaches either lack the fidelity to capture fine details (like hair) or are too slow for real-time use. Relightable Gaussian Codec Avatars proposes using (i) 3D Gaussian splatting for efficient geometry representation and (ii) a learnable radiance transfer model for appearance, including an explicit eye model. Background Image relighting is the task of showing what a scene from a source image would look like if illuminated differently....

[Summary] Training Vision Transformers with Only 2040 Images

TL;DR Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) with sufficient data but are data-hungry, limiting their use with small datasets. The authors propose a method to train ViTs with limited data by pre-training with label smoothing, lower resolution images, and parametric instance discrimination, followed by fine-tuning on the target task. Method Training a Vision Transformer on small datasets involves two steps Self-supervised pretraining: Parametric instance discrimination: Classify each image as its own class....

[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

[Summary] ReAct: Synergizing Reasoning and Acting in Language Models

TL;DR Large Language Models (LLMs) often suffer from hallucinations. Two common mitigation strategies are Chain of Thought (CoT), where the LLM is prompted to show its step-by-step reasoning, and Act, where LLMs use external tools to ground their answers in reliable databases. However, CoT relies on the model’s internal representations, limiting its ability to reason reactively or update its knowledge. ReAct is a prompting method that combines CoT with action plan generation using external tools....

[Summary] Unifying Generative and Dense Retrieval for Sequential Recommendation

TL;DR Traditional item retrieval methods use user and item embeddings to predict relevance via inner product computation, which is not scalable for large systems. Generative models predict item indices directly but struggle with new items. This work proposes a hybrid model that combines item positions, text representations, and semantic IDs to predict both the next item embedding and several possible next item IDs. Then only this item subset along the new items are in the inner product with user representations....

[Summary] The Evolution of Multimodal Model Architectures

TL;DR Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types. Types A and B integrate multimodal data within the internal layers of the model. Type A relies on standard cross-attention for fusion Type B introduces custom-designed layers for multimodal fusion Types C and D fuse multimodal at the input stage (early fusion) Type C uses modality-specific encoders without tokenization Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models) Model Architecture Overview Models processing images, audio, or video alongside text have evolved significantly....

[Summary] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

TL;DR State-of-the-art language models are primarily decoder-only, focusing on token prediction rather than producing rich contextualized embeddings for downstream tasks. LLM2Vec introduces an unsupervised method to transform decoder-only models into encoders. This approach involves: (i) enabling bidirectional attention, (ii) training on masked token prediction, and (iii) incorporating unsupervised contrastive learning. The result is that these converted models outperform traditional encoder-only models. Background Until recently, large language models (LLMs) were predominantly based on bidirectional encoders or encoder-decoder frameworks like BERT and T5....

[Summary] Fine-Grained Fashion Similarity Prediction by Attribute-Specific Embedding Learning

TL;DR In the fashion domain, visually distinct products may share fine-grained attributes like sleeve length or collar shape. Traditional methods of finding similar products often overlook these details, leading to irrelevant results for the user. To address this, the authors propose a model with two branches: a global branch that processes the entire image and a local branch takes a Region Of Interest (ROI) of specific attributes, identified through a spatial attention layer in the global branch....

[Lecture notes] Algorithms and Hardness for Attention and Kernel Density Estimation

TL;DR Kernel Density Estimation (KDE) is a statistical technique with applications across various fields, such as estimating the distribution of a random variable and computing the attention layer in Transformers. While the standard algorithm for KDE has a quadratic time complexity, this presentation introduces two advanced techniques (the polynomial method and the Fast Multipole Method) that reduce the computation time to nearly linear in certain cases. KDE problem formulation Inputs....

[Summary] Vision Language Model are Blinds

TL;DR The recent trend is to equip Large Language models with vision capabilities and creating Visual Language models (VLM). However, it’s unclear how well VLMs perform on simple vision tasks. This paper introduces “BlindTest”, a benchmark of 7 simple tasks, such as identifying overlapping circles, intersecting lines, and circled letters. The results show that VLMs achieve only 58.57% accuracy on average, far from the expected human accuracy of 100%. Task example The paper aims to investigate how VLMs perceive simple images composed of basic geometric shapes....