Large Language Models

[Summary] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective

TL;DR Popular methods for chain‑of‑thought (CoT) reasoning (e.g supervised fine‑tuning, Tree‑of‑Thoughts) have three challenges: (i) distribution drift where small mistakes spiral with no recovery mechanism, (ii) missing search structure such that there’s no built-in exploration or backtracking, (iii) and explosive computational cost. The proposed Diligent Learner models reasoning as depth-first search guided by a validator. It is trained by building reasoning paths step-by-step, checks each one for correctness, and backtracks when needed....

[Summary] Ada-R1: Hybrid CoT via Bi-Level Adaptive Reasoning Optimization

TL;DR Chain-of-Thought (CoT) enables large language models (LLMs) to solve complex tasks by generating intermediate reasoning steps. Ada-R1 approach fine-tunes a model to prefer Short-CoT over Long-CoT based on problem complexity, using training a model to minimize reasoning length while preserving accuracy. This approach reduces average reasoning length by over 50%, substantially lowering inference cost, with maintained accuracy across five mathematical reasoning benchmarks. Background CoT prompting decomposes complex tasks into intermediate reasoning steps....

[Summary] LettuceDetect: A Hallucination Detection Framework for RAG Applications

TL;DR Retrieval-Augmented Generation (RAG) grounds large-language-model (LLM) answers in external documents, yet hallucinations persist. Existing detectors either rely on expensive LLM as a judges or on encoder classifiers that truncate context and lose evidence. LettuceDetect introduces a long-context, token-level classifier built on ModernBERT. It surpasses prior encoder baselines while remaining markedly more efficient than LLM-based judges. Background LLMs hallucinate when generated claims are not supported by retrieved context. Encoder detectors shorten inputs to fit model limits (context size), reducing recall, whereas generative judges process full context but incur high latency and cost....

[Summary] ReAct: Synergizing Reasoning and Acting in Language Models

TL;DR Large Language Models (LLMs) often suffer from hallucinations. Two common mitigation strategies are Chain of Thought (CoT), where the LLM is prompted to show its step-by-step reasoning, and Act, where LLMs use external tools to ground their answers in reliable databases. However, CoT relies on the model’s internal representations, limiting its ability to reason reactively or update its knowledge. ReAct is a prompting method that combines CoT with action plan generation using external tools....

[Summary] Unifying Generative and Dense Retrieval for Sequential Recommendation

TL;DR Traditional item retrieval methods use user and item embeddings to predict relevance via inner product computation, which is not scalable for large systems. Generative models predict item indices directly but struggle with new items. This work proposes a hybrid model that combines item positions, text representations, and semantic IDs to predict both the next item embedding and several possible next item IDs. Then only this item subset along the new items are in the inner product with user representations....

[Summary] LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

TL;DR State-of-the-art language models are primarily decoder-only, focusing on token prediction rather than producing rich contextualized embeddings for downstream tasks. LLM2Vec introduces an unsupervised method to transform decoder-only models into encoders. This approach involves: (i) enabling bidirectional attention, (ii) training on masked token prediction, and (iii) incorporating unsupervised contrastive learning. The result is that these converted models outperform traditional encoder-only models. Background Until recently, large language models (LLMs) were predominantly based on bidirectional encoders or encoder-decoder frameworks like BERT and T5....

CVPR 2024 Summary

Last week I attended the CVPR conference, a gathering of computer vision researchers and professionals showcasing the latest advancements in the field. Some interesting recent trends: Multimodal models and datasets Large Language Models (LLMs) are being used to train vision models Images are used to ground LLMs, reducing their hallucination Models are being fed with both images and videos to achieve better results Foundation models are commodity These models are becoming more accessible and less expensive to create They are trained on multiple modalities and tasks (even for a very niche tasks like hand pose estimation) Transformers are everywhere: While not a new trend, it’s still notable that attention mechanisms are incorporated into almost every model....

[Lecture notes] Let's build the GPT Tokenizer

Andrej Karpathy has released a great series of in-depth-hands-on of building GPT models. Here are my notes taken during watching the “Let’s build the GPT Tokenizer” video. What are Tokens? Large Language Models (LLM) don’t process the raw text directly. They use tokens are the out of the Tokenization process which translates text into sequence of tokens. Many issues of LLMs are mainly due to tokenization: LLMs are bad at simple arithmetic....

[Summary] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

TL;DR Generative Large Language Models (LLMs) are limited to generate text based on their training data which means any extension to additional sources necessitates additional training. Retrieval Augmented Generation (RAG) is a method that combines the use of a database with LLMs enables updating the LLM knowledge and make it more precise for specific applications. Method Building blocks The method consists of 3 building blocks. Document index. A pre-trained model was used to encode documents into embeddings to create the index....

[Summary] Object Recognition as Next Token Prediction

TL;DR Models for object classification require a fixed set of pre-defined classes which constrain the model from recognizing any object. In this paper, a visual classifier is trained to predict the most likely token of a pre-trained Large Language Model (LLM). Given that LLMs are trained on extensive textual data, training a model to predict across the entire token space allows it capture the full range of textual information. Methods The model is trained to predict the probability for each token of a pretrained LLM: Denote Xv as the visual features, W as the LLM token embeddings, and w represents the most probable single token, the model prediction is To guide the language decoder, the authors prompt it with “the objects in the image are” (Xp)....