[Summary] Mitigating Hallucinations in Multimodal LLMs With Attention Causal Decoding

TL;DR Hallucinations in multimodal LLMs fall into two categories: initial hallucinations, caused by insufficient model knowledge, and snowball hallucinations, where prior errors are reinforced for consistency. FarSight tackles both by redesigning information propagation: (i) sink tokens absorb uninformative signals to prevent downstream pollution and (ii) attention decay grounds the model in early generation tokens, curbing the snowball effect. Motivation Two key observations drive this work: Attention collapse: Models disproportionately attend to low-information tokens (e....

February 21, 2026 · 3 min · 521 words

[Concept] Inside Transformer Attention

Attention Layer Attention blocks are the backbone of the Transformer architecture, enabling the model to capture dependencies across the input sequence. An attention layer takes as input: A query vector \(q \in \mathbb{R}^d\) A matrix of keys \(K \in \mathbb{R}^{n \times d}\) (rows are \(k_i^\top\)) A matrix of values \(V \in \mathbb{R}^{n \times d_v}\) In the vanilla Transformer setup, the query, key, and value come from the same token embedding \(x\) but the model is free to learn different subspaces for “asking” (queries), “addressing” (keys), and “answering” (values):...

August 22, 2025 · 2 min · 418 words

[Lecture notes] Algorithms and Hardness for Attention and Kernel Density Estimation

TL;DR Kernel Density Estimation (KDE) is a statistical technique with applications across various fields, such as estimating the distribution of a random variable and computing the attention layer in Transformers. While the standard algorithm for KDE has a quadratic time complexity, this presentation introduces two advanced techniques (the polynomial method and the Fast Multipole Method) that reduce the computation time to nearly linear in certain cases. KDE problem formulation Inputs....

August 24, 2024 · 3 min · 514 words