[Summary] LettuceDetect: A Hallucination Detection Framework for RAG Applications

TL;DR Retrieval-Augmented Generation (RAG) grounds large-language-model (LLM) answers in external documents, yet hallucinations persist. Existing detectors either rely on expensive LLM as a judges or on encoder classifiers that truncate context and lose evidence. LettuceDetect introduces a long-context, token-level classifier built on ModernBERT. It surpasses prior encoder baselines while remaining markedly more efficient than LLM-based judges. Background LLMs hallucinate when generated claims are not supported by retrieved context. Encoder detectors shorten inputs to fit model limits (context size), reducing recall, whereas generative judges process full context but incur high latency and cost....

April 25, 2025 · 2 min · 220 words

[Summary] On the Biology of a Large Language Model

TL;DR Large Language Models (LLMs) are often perceived as “black boxes,” making their decision-making and reasoning processes difficult to interpret. A novel method simplifies these complex models by replacing internal nonlinear layers with linear modules tailored to clearly understandable features. This approach reveals structured reasoning, planning behaviors, and even hidden intentions within the model’s computations. Method Interpreting LLMs is challenging because individual neurons often represent multiple, unrelated concepts simultaneously (polysemanticity). To address this, the approach creates a simplified “replacement model”, preserving most of the original model’s performance while enhancing interpretability through these steps:...

April 12, 2025 · 2 min · 367 words

[Summary] VGGT: Visual Geometry Grounded Transformer

TL;DR Traditional 3D reconstruction relied on iterative visual-geometry optimization (e.g., Bundle Adjustment). Recent work explored integrating machine learning via differentiable Bundle Adjustment, but remained slow and limited. VGGT (Visual Geometry Grounded Transformer) is a large feed-forward transformer that predicts all key 3D scene attributes—camera parameters, depth maps, point maps, and 3D point tracks—directly from one or many images in a single forward pass. It removes the need for geometry processing, achieves state-of-the-art results in multiple benchmarks, and runs in under a second....

April 5, 2025 · 3 min · 479 words

[Summary] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TL;DR Transformer models often map multiple concepts to the same neuron, making it unclear what features they learn. This work makes inner representations interpretable by using a sparse autoencoder layer to map neurons to concepts. This method extracts relatively monosemantic concepts, can steer transformer generation, and shows that 512 neurons can represent tens of thousands of features. Method A major challenge in reverse engineering neural networks is the curse of dimensionality: as models grow, the latent space volume increases exponentially....

March 15, 2025 · 2 min · 333 words

[Summary] Relightable Gaussian Codec Avatars

TL;DR Photorealistic head avatars is a required technology for virtual and augmented reality. However, current approaches either lack the fidelity to capture fine details (like hair) or are too slow for real-time use. Relightable Gaussian Codec Avatars proposes using (i) 3D Gaussian splatting for efficient geometry representation and (ii) a learnable radiance transfer model for appearance, including an explicit eye model. Background Image relighting is the task of showing what a scene from a source image would look like if illuminated differently....

February 28, 2025 · 3 min · 606 words

[Summary] Training Vision Transformers with Only 2040 Images

TL;DR Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) with sufficient data but are data-hungry, limiting their use with small datasets. The authors propose a method to train ViTs with limited data by pre-training with label smoothing, lower resolution images, and parametric instance discrimination, followed by fine-tuning on the target task. Method Training a Vision Transformer on small datasets involves two steps Self-supervised pretraining: Parametric instance discrimination: Classify each image as its own class....

February 15, 2025 · 2 min · 217 words

[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

February 1, 2025 · 2 min · 400 words

[Summary] ReAct: Synergizing Reasoning and Acting in Language Models

TL;DR Large Language Models (LLMs) often suffer from hallucinations. Two common mitigation strategies are Chain of Thought (CoT), where the LLM is prompted to show its step-by-step reasoning, and Act, where LLMs use external tools to ground their answers in reliable databases. However, CoT relies on the model’s internal representations, limiting its ability to reason reactively or update its knowledge. ReAct is a prompting method that combines CoT with action plan generation using external tools....

January 17, 2025 · 1 min · 203 words

[Summary] Unifying Generative and Dense Retrieval for Sequential Recommendation

TL;DR Traditional item retrieval methods use user and item embeddings to predict relevance via inner product computation, which is not scalable for large systems. Generative models predict item indices directly but struggle with new items. This work proposes a hybrid model that combines item positions, text representations, and semantic IDs to predict both the next item embedding and several possible next item IDs. Then only this item subset along the new items are in the inner product with user representations....

January 4, 2025 · 2 min · 367 words

[Summary] The Evolution of Multimodal Model Architectures

TL;DR Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types. Types A and B integrate multimodal data within the internal layers of the model. Type A relies on standard cross-attention for fusion Type B introduces custom-designed layers for multimodal fusion Types C and D fuse multimodal at the input stage (early fusion) Type C uses modality-specific encoders without tokenization Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models) Model Architecture Overview Models processing images, audio, or video alongside text have evolved significantly....

November 1, 2024 · 3 min · 427 words

[Summary] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

TL;DR State-of-the-art language models are primarily decoder-only, focusing on token prediction rather than producing rich contextualized embeddings for downstream tasks. LLM2Vec introduces an unsupervised method to transform decoder-only models into encoders. This approach involves: (i) enabling bidirectional attention, (ii) training on masked token prediction, and (iii) incorporating unsupervised contrastive learning. The result is that these converted models outperform traditional encoder-only models. Background Until recently, large language models (LLMs) were predominantly based on bidirectional encoders or encoder-decoder frameworks like BERT and T5....

October 18, 2024 · 2 min · 335 words

[Summary] Fine-Grained Fashion Similarity Prediction by Attribute-Specific Embedding Learning

TL;DR In the fashion domain, visually distinct products may share fine-grained attributes like sleeve length or collar shape. Traditional methods of finding similar products often overlook these details, leading to irrelevant results for the user. To address this, the authors propose a model with two branches: a global branch that processes the entire image and a local branch takes a Region Of Interest (ROI) of specific attributes, identified through a spatial attention layer in the global branch....

October 4, 2024 · 3 min · 457 words

[Lecture notes] Algorithms and Hardness for Attention and Kernel Density Estimation

TL;DR Kernel Density Estimation (KDE) is a statistical technique with applications across various fields, such as estimating the distribution of a random variable and computing the attention layer in Transformers. While the standard algorithm for KDE has a quadratic time complexity, this presentation introduces two advanced techniques (the polynomial method and the Fast Multipole Method) that reduce the computation time to nearly linear in certain cases. KDE problem formulation Inputs....

August 24, 2024 · 3 min · 514 words

[Summary] Vision Language Model are Blinds

TL;DR The recent trend is to equip Large Language models with vision capabilities and creating Visual Language models (VLM). However, it’s unclear how well VLMs perform on simple vision tasks. This paper introduces “BlindTest”, a benchmark of 7 simple tasks, such as identifying overlapping circles, intersecting lines, and circled letters. The results show that VLMs achieve only 58.57% accuracy on average, far from the expected human accuracy of 100%. Task example The paper aims to investigate how VLMs perceive simple images composed of basic geometric shapes....

August 17, 2024 · 2 min · 404 words

[Summary] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

TL;DR Existing diffusion models for sequence generation have two main limitations: They either generate sequences one token at a time, without the ability to steer the sampling process towards desired outcomes, or diffuse the entire sequence iteratively but are constrained to a fixed sequence length. Diffusion Forcing combines the benefits of both approaches by diffusing the entire sequence iteratively with independent per-token noise levels that are conditioned on the previous token in the sequence....

July 21, 2024 · 2 min · 387 words

CVPR 2024 Summary

Last week I attended the CVPR conference, a gathering of computer vision researchers and professionals showcasing the latest advancements in the field. Some interesting recent trends: Multimodal models and datasets Large Language Models (LLMs) are being used to train vision models Images are used to ground LLMs, reducing their hallucination Models are being fed with both images and videos to achieve better results Foundation models are commodity These models are becoming more accessible and less expensive to create They are trained on multiple modalities and tasks (even for a very niche tasks like hand pose estimation) Transformers are everywhere: While not a new trend, it’s still notable that attention mechanisms are incorporated into almost every model....

June 29, 2024 · 8 min · 1572 words

[Lecture notes] Let's build the GPT Tokenizer

Andrej Karpathy has released a great series of in-depth-hands-on of building GPT models. Here are my notes taken during watching the “Let’s build the GPT Tokenizer” video. What are Tokens? Large Language Models (LLM) don’t process the raw text directly. They use tokens are the out of the Tokenization process which translates text into sequence of tokens. Many issues of LLMs are mainly due to tokenization: LLMs are bad at simple arithmetic....

June 8, 2024 · 5 min · 926 words

[Summary] Semi-supervised Learning Made Simple with Self-supervised Clustering

TL;DR In self-supervised learning there are no guarantees that representations will organize the clusters according to their semantic classes. When labels are partially available the authors propose to replace the cluster centroids with class prototypes learned with supervision. In this way, unlabeled samples will be clustered around the class prototypes, guided by the self-supervised clustering-based objective. Method The method trains a model by jointly optimize a supervised loss on labeled data and a self-supervised loss on unlabeled data using the same loss function (cross-entropy)....

May 14, 2024 · 2 min · 407 words

[Summary] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

TL;DR Generative Large Language Models (LLMs) are limited to generate text based on their training data which means any extension to additional sources necessitates additional training. Retrieval Augmented Generation (RAG) is a method that combines the use of a database with LLMs enables updating the LLM knowledge and make it more precise for specific applications. Method Building blocks The method consists of 3 building blocks. Document index. A pre-trained model was used to encode documents into embeddings to create the index....

April 29, 2024 · 2 min · 335 words

[Summary] Object Recognition as Next Token Prediction

TL;DR Models for object classification require a fixed set of pre-defined classes which constrain the model from recognizing any object. In this paper, a visual classifier is trained to predict the most likely token of a pre-trained Large Language Model (LLM). Given that LLMs are trained on extensive textual data, training a model to predict across the entire token space allows it capture the full range of textual information. Methods The model is trained to predict the probability for each token of a pretrained LLM: Denote Xv as the visual features, W as the LLM token embeddings, and w represents the most probable single token, the model prediction is To guide the language decoder, the authors prompt it with “the objects in the image are” (Xp)....

April 23, 2024 · 2 min · 267 words