How Transformers Learn Order: Absolute, Relative, and Rotary Positions

Transformers process tokens in parallel and have no built‑in sense of order. Positional encodings inject information about where each token appears in the sequence. The basic idea: each position gets a vector, and that vector is added to the token embedding before entering the transformer. This lets the model distinguish sequences like “dog bites man” from “man bites dog” and learn how order affects meaning. Over time, several approaches to positional encoding have emerged, ranging from fixed sinusoidal schemes to fully learned embeddings, relative encodings, and rotary methods, each with different tradeoffs in flexibility, inductive bias, and length generalization....

November 15, 2025 · 5 min · 1063 words

From DETR to RF-DETR: The Evolution of End-to-End Object Detection

TL;DR Object detection has shifted from heavy, hand-engineered pipelines based on anchors and heuristics to end to end transformer architectures that learn object localization and classification jointly. This progression (from DETR 2020 to RF-DETR 2025) has reduced post-processing, improved training stability, and brought real-time inference within reach. DETR: End-to-End Object Detection with Transformers (2020) DEtection TRansformer (DETR) introduced a simple yet novel idea: formulate object detection as a direct set prediction problem solved with transformers....

October 31, 2025 · 4 min · 756 words

[Summary] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

TL;DR Grounding DINO is an open-set object detector that integrates natural language supervision into the DETR-style DINO framework: Instead of being limited to a fixed set of classes, it allows specifying text prompts (e.g., zebra, traffic light) and find those objects within images at inference time. The model achieves this by coupling image and text representations throughout its architecture using cross-modality attention and language-conditioned query mechanisms. Motivation Closed‑set detectors are limited to a fixed label list and cannot recognize unseen categories without new annotations and retraining....

October 13, 2025 · 3 min · 584 words

[Summary] DINOv3: Self-Supervised Vision Transformers at Scale

TLDR The DINO series advances self-supervised learning for vision transformers through iterative architectural and data refinements. DINOv1 introduces student-teacher distillation on ImageNet-1k. DINOv2 scales to 142M curated images with patch-level objectives. DINOv3 reaches 1.7B Instagram images with register tokens, new Gram matrix based loss, with a custom 7B-parameter ViT, achieving state-of-the-art performance on dense prediction tasks (like instance segmentation) while maintaining a frozen backbone. Motivation Supervised pretraining on ImageNet has dominated vision models, but manually annotating large datasets is expensive and constrains representation quality to label granularity....

October 11, 2025 · 5 min · 871 words

[Summary] Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

TL;DR Many machine learning systems deal with multimodal data, however there’s no study examining different design choices across modalities. The paper surveys common “tricks” for multimodal systems and found the most effective techniques are: (i) basic strategies such as gradient clipping and learning-rate warmup, (ii) late fusion using pretrained unimodal encoders, (iii) auxiliary cross-modal alignment objectives, (iv) simple input-level augmentation, and (v) modality dropout and learnable embeddings for handling missing inputs....

September 9, 2025 · 3 min · 626 words

[Concept] Inside Transformer Attention

Attention Layer Attention blocks are the backbone of the Transformer architecture, enabling the model to capture dependencies across the input sequence. An attention layer takes as input: A query vector \(q \in \mathbb{R}^d\) A matrix of keys \(K \in \mathbb{R}^{n \times d}\) (rows are \(k_i^\top\)) A matrix of values \(V \in \mathbb{R}^{n \times d_v}\) In the vanilla Transformer setup, the query, key, and value come from the same token embedding \(x\) but the model is free to learn different subspaces for “asking” (queries), “addressing” (keys), and “answering” (values):...

August 22, 2025 · 2 min · 418 words

[Summary] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

TL;DR Machine learning model evaluation commonly reports the “highest number” often lacking any kind of statistical significance. This creates misleading comparisons, especially when differences between models are small. The paper overviews the different methods of adding statistical error bars to evals, covering independent and clustered questions, paired model comparisons, and power analysis. These tools help quantify uncertainty and avoid overconfident claims about which model is better. Motivation LLM evals often treat the top score as definitive....

August 20, 2025 · 5 min · 907 words

[Summary] MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

TL;DR Recent advancements in vector retrieval demonstrate that mapping queries and documents to multiple vectors and performing multi-vector retrieval surpasses common single-vector retrieval methods. However, multi-vector retrieval is computationally intensive as it does not use linear operations like the inner product. MUVERA is a method designed to accelerate multi-vector retrieval: It converts sets of vectors into single Fixed Dimensional Encodings (FDEs). The inner product of two FDEs approximates the Chamfer Similarity score, allowing standard, highly-optimized Maximum Inner Product Search (MIPS) solvers to generate a candidate list, following by exact similarity calculation on this small set of candidates for precise ranking....

August 11, 2025 · 3 min · 543 words

[Summary] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective

TL;DR Popular methods for chain‑of‑thought (CoT) reasoning (e.g supervised fine‑tuning, Tree‑of‑Thoughts) have three challenges: (i) distribution drift where small mistakes spiral with no recovery mechanism, (ii) missing search structure such that there’s no built-in exploration or backtracking, (iii) and explosive computational cost. The proposed Diligent Learner models reasoning as depth-first search guided by a validator. It is trained by building reasoning paths step-by-step, checks each one for correctness, and backtracks when needed....

August 7, 2025 · 5 min · 875 words

[Summary] Ada-R1: Hybrid CoT via Bi-Level Adaptive Reasoning Optimization

TL;DR Chain-of-Thought (CoT) enables large language models (LLMs) to solve complex tasks by generating intermediate reasoning steps. Ada-R1 approach fine-tunes a model to prefer Short-CoT over Long-CoT based on problem complexity, using training a model to minimize reasoning length while preserving accuracy. This approach reduces average reasoning length by over 50%, substantially lowering inference cost, with maintained accuracy across five mathematical reasoning benchmarks. Background CoT prompting decomposes complex tasks into intermediate reasoning steps....

May 1, 2025 · 2 min · 372 words

[Summary] LettuceDetect: A Hallucination Detection Framework for RAG Applications

TL;DR Retrieval-Augmented Generation (RAG) grounds large-language-model (LLM) answers in external documents, yet hallucinations persist. Existing detectors either rely on expensive LLM as a judges or on encoder classifiers that truncate context and lose evidence. LettuceDetect introduces a long-context, token-level classifier built on ModernBERT. It surpasses prior encoder baselines while remaining markedly more efficient than LLM-based judges. Background LLMs hallucinate when generated claims are not supported by retrieved context. Encoder detectors shorten inputs to fit model limits (context size), reducing recall, whereas generative judges process full context but incur high latency and cost....

April 25, 2025 · 2 min · 220 words

[Summary] On the Biology of a Large Language Model

TL;DR Large Language Models (LLMs) are often perceived as “black boxes,” making their decision-making and reasoning processes difficult to interpret. A novel method simplifies these complex models by replacing internal nonlinear layers with linear modules tailored to clearly understandable features. This approach reveals structured reasoning, planning behaviors, and even hidden intentions within the model’s computations. Method Interpreting LLMs is challenging because individual neurons often represent multiple, unrelated concepts simultaneously (polysemanticity). To address this, the approach creates a simplified “replacement model”, preserving most of the original model’s performance while enhancing interpretability through these steps:...

April 12, 2025 · 2 min · 367 words

[Summary] VGGT: Visual Geometry Grounded Transformer

TL;DR Traditional 3D reconstruction relied on iterative visual-geometry optimization (e.g., Bundle Adjustment). Recent work explored integrating machine learning via differentiable Bundle Adjustment, but remained slow and limited. VGGT (Visual Geometry Grounded Transformer) is a large feed-forward transformer that predicts all key 3D scene attributes—camera parameters, depth maps, point maps, and 3D point tracks—directly from one or many images in a single forward pass. It removes the need for geometry processing, achieves state-of-the-art results in multiple benchmarks, and runs in under a second....

April 5, 2025 · 3 min · 479 words

[Summary] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TL;DR Transformer models often map multiple concepts to the same neuron, making it unclear what features they learn. This work makes inner representations interpretable by using a sparse autoencoder layer to map neurons to concepts. This method extracts relatively monosemantic concepts, can steer transformer generation, and shows that 512 neurons can represent tens of thousands of features. Method A major challenge in reverse engineering neural networks is the curse of dimensionality: as models grow, the latent space volume increases exponentially....

March 15, 2025 · 2 min · 333 words

[Summary] Relightable Gaussian Codec Avatars

TL;DR Photorealistic head avatars is a required technology for virtual and augmented reality. However, current approaches either lack the fidelity to capture fine details (like hair) or are too slow for real-time use. Relightable Gaussian Codec Avatars proposes using (i) 3D Gaussian splatting for efficient geometry representation and (ii) a learnable radiance transfer model for appearance, including an explicit eye model. Background Image relighting is the task of showing what a scene from a source image would look like if illuminated differently....

February 28, 2025 · 3 min · 606 words

[Summary] Training Vision Transformers with Only 2040 Images

TL;DR Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) with sufficient data but are data-hungry, limiting their use with small datasets. The authors propose a method to train ViTs with limited data by pre-training with label smoothing, lower resolution images, and parametric instance discrimination, followed by fine-tuning on the target task. Method Training a Vision Transformer on small datasets involves two steps Self-supervised pretraining: Parametric instance discrimination: Classify each image as its own class....

February 15, 2025 · 2 min · 217 words

[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

February 1, 2025 · 2 min · 400 words

[Summary] ReAct: Synergizing Reasoning and Acting in Language Models

TL;DR Large Language Models (LLMs) often suffer from hallucinations. Two common mitigation strategies are Chain of Thought (CoT), where the LLM is prompted to show its step-by-step reasoning, and Act, where LLMs use external tools to ground their answers in reliable databases. However, CoT relies on the model’s internal representations, limiting its ability to reason reactively or update its knowledge. ReAct is a prompting method that combines CoT with action plan generation using external tools....

January 17, 2025 · 1 min · 203 words

[Summary] Unifying Generative and Dense Retrieval for Sequential Recommendation

TL;DR Traditional item retrieval methods use user and item embeddings to predict relevance via inner product computation, which is not scalable for large systems. Generative models predict item indices directly but struggle with new items. This work proposes a hybrid model that combines item positions, text representations, and semantic IDs to predict both the next item embedding and several possible next item IDs. Then only this item subset along the new items are in the inner product with user representations....

January 4, 2025 · 2 min · 367 words

[Summary] The Evolution of Multimodal Model Architectures

TL;DR Multimodal models are advancing rapidly across research and industry. Their architecture can be characterized into four different types. Types A and B integrate multimodal data within the internal layers of the model. Type A relies on standard cross-attention for fusion Type B introduces custom-designed layers for multimodal fusion Types C and D fuse multimodal at the input stage (early fusion) Type C uses modality-specific encoders without tokenization Type D employs tokenizers for each modality at the input and able to generate outputs with multimodalities (any-to-any multimodal models) Model Architecture Overview Models processing images, audio, or video alongside text have evolved significantly....

November 1, 2024 · 3 min · 427 words