Welcome 👋

I’m Koby, a machine learning engineer. My PHD centered around computer vision and information theory algorithms.
In my free time, I enjoy my guilty pleasures: coffee and carbs.
This website serves as a repository for my learning notes, paper summaries, and proof of concepts.

[Summary] Perception Encoder: The best visual embeddings are not at the output of the network

TL;DR Current beliefs suggest that a single vision model cannot achieve SOTA performance across language-centric and spatial tasks. Perception Encoder (PE) is a vision encoder demonstrates that contrastive vision-language pretraining provides versatile features suitable for both multimodal language modeling and dense spatial prediction. These diverse capabilities reside within intermediate layers rather than the model output. The authors fine-tune two PE variants to migrate these hidden representations to the final layer: one optimized for language tasks and another for spatial tasks....

[Summary] Why Less is More (Sometimes): A Theory of Data Curation

TL;DR Training LLMs often requiring hundreds of billions of tokens, yet not all data points contribute equally to learning: while some accelerate progress, others are redundant or even detrimental. The paper builds a theory for when a smaller, curated dataset can outperform using all available data in high dimensional learning. It models a label generator, and pruning oracle, then derives test error scaling laws showing that “less is more” only in a specific regime where data is abundant and the label generator is strong, while “more is more” remains optimal in most other regimes....

How Transformers Learn Order: Absolute, Relative, and Rotary Positions

Transformers process tokens in parallel and have no built‑in sense of order. Positional encodings inject information about where each token appears in the sequence. The basic idea: each position gets a vector, and that vector is added to the token embedding before entering the transformer. This lets the model distinguish sequences like “dog bites man” from “man bites dog” and learn how order affects meaning. Over time, several approaches to positional encoding have emerged, ranging from fixed sinusoidal schemes to fully learned embeddings, relative encodings, and rotary methods, each with different tradeoffs in flexibility, inductive bias, and length generalization....

From DETR to RF-DETR: The Evolution of End-to-End Object Detection

TL;DR Object detection has shifted from heavy, hand-engineered pipelines based on anchors and heuristics to end to end transformer architectures that learn object localization and classification jointly. This progression (from DETR 2020 to RF-DETR 2025) has reduced post-processing, improved training stability, and brought real-time inference within reach. DETR: End-to-End Object Detection with Transformers (2020) DEtection TRansformer (DETR) introduced a simple yet novel idea: formulate object detection as a direct set prediction problem solved with transformers....

[Summary] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

TL;DR Grounding DINO is an open-set object detector that integrates natural language supervision into the DETR-style DINO framework: Instead of being limited to a fixed set of classes, it allows specifying text prompts (e.g., zebra, traffic light) and find those objects within images at inference time. The model achieves this by coupling image and text representations throughout its architecture using cross-modality attention and language-conditioned query mechanisms. Motivation Closed‑set detectors are limited to a fixed label list and cannot recognize unseen categories without new annotations and retraining....

[Summary] DINOv3: Self-Supervised Vision Transformers at Scale

TLDR The DINO series advances self-supervised learning for vision transformers through iterative architectural and data refinements. DINOv1 introduces student-teacher distillation on ImageNet-1k. DINOv2 scales to 142M curated images with patch-level objectives. DINOv3 reaches 1.7B Instagram images with register tokens, new Gram matrix based loss, with a custom 7B-parameter ViT, achieving state-of-the-art performance on dense prediction tasks (like instance segmentation) while maintaining a frozen backbone. Motivation Supervised pretraining on ImageNet has dominated vision models, but manually annotating large datasets is expensive and constrains representation quality to label granularity....

[Summary] Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

TL;DR Many machine learning systems deal with multimodal data, however there’s no study examining different design choices across modalities. The paper surveys common “tricks” for multimodal systems and found the most effective techniques are: (i) basic strategies such as gradient clipping and learning-rate warmup, (ii) late fusion using pretrained unimodal encoders, (iii) auxiliary cross-modal alignment objectives, (iv) simple input-level augmentation, and (v) modality dropout and learnable embeddings for handling missing inputs....

[Concept] Inside Transformer Attention

Attention Layer Attention blocks are the backbone of the Transformer architecture, enabling the model to capture dependencies across the input sequence. An attention layer takes as input: A query vector \(q \in \mathbb{R}^d\) A matrix of keys \(K \in \mathbb{R}^{n \times d}\) (rows are \(k_i^\top\)) A matrix of values \(V \in \mathbb{R}^{n \times d_v}\) In the vanilla Transformer setup, the query, key, and value come from the same token embedding \(x\) but the model is free to learn different subspaces for “asking” (queries), “addressing” (keys), and “answering” (values):...

[Summary] Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

TL;DR Machine learning model evaluation commonly reports the “highest number” often lacking any kind of statistical significance. This creates misleading comparisons, especially when differences between models are small. The paper overviews the different methods of adding statistical error bars to evals, covering independent and clustered questions, paired model comparisons, and power analysis. These tools help quantify uncertainty and avoid overconfident claims about which model is better. Motivation LLM evals often treat the top score as definitive....

[Summary] MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

TL;DR Recent advancements in vector retrieval demonstrate that mapping queries and documents to multiple vectors and performing multi-vector retrieval surpasses common single-vector retrieval methods. However, multi-vector retrieval is computationally intensive as it does not use linear operations like the inner product. MUVERA is a method designed to accelerate multi-vector retrieval: It converts sets of vectors into single Fixed Dimensional Encodings (FDEs). The inner product of two FDEs approximates the Chamfer Similarity score, allowing standard, highly-optimized Maximum Inner Product Search (MIPS) solvers to generate a candidate list, following by exact similarity calculation on this small set of candidates for precise ranking....

[Summary] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective

TL;DR Popular methods for chain‑of‑thought (CoT) reasoning (e.g supervised fine‑tuning, Tree‑of‑Thoughts) have three challenges: (i) distribution drift where small mistakes spiral with no recovery mechanism, (ii) missing search structure such that there’s no built-in exploration or backtracking, (iii) and explosive computational cost. The proposed Diligent Learner models reasoning as depth-first search guided by a validator. It is trained by building reasoning paths step-by-step, checks each one for correctness, and backtracks when needed....

[Summary] Ada-R1: Hybrid CoT via Bi-Level Adaptive Reasoning Optimization

TL;DR Chain-of-Thought (CoT) enables large language models (LLMs) to solve complex tasks by generating intermediate reasoning steps. Ada-R1 approach fine-tunes a model to prefer Short-CoT over Long-CoT based on problem complexity, using training a model to minimize reasoning length while preserving accuracy. This approach reduces average reasoning length by over 50%, substantially lowering inference cost, with maintained accuracy across five mathematical reasoning benchmarks. Background CoT prompting decomposes complex tasks into intermediate reasoning steps....

[Summary] LettuceDetect: A Hallucination Detection Framework for RAG Applications

TL;DR Retrieval-Augmented Generation (RAG) grounds large-language-model (LLM) answers in external documents, yet hallucinations persist. Existing detectors either rely on expensive LLM as a judges or on encoder classifiers that truncate context and lose evidence. LettuceDetect introduces a long-context, token-level classifier built on ModernBERT. It surpasses prior encoder baselines while remaining markedly more efficient than LLM-based judges. Background LLMs hallucinate when generated claims are not supported by retrieved context. Encoder detectors shorten inputs to fit model limits (context size), reducing recall, whereas generative judges process full context but incur high latency and cost....

[Summary] On the Biology of a Large Language Model

TL;DR Large Language Models (LLMs) are often perceived as “black boxes,” making their decision-making and reasoning processes difficult to interpret. A novel method simplifies these complex models by replacing internal nonlinear layers with linear modules tailored to clearly understandable features. This approach reveals structured reasoning, planning behaviors, and even hidden intentions within the model’s computations. Method Interpreting LLMs is challenging because individual neurons often represent multiple, unrelated concepts simultaneously (polysemanticity). To address this, the approach creates a simplified “replacement model”, preserving most of the original model’s performance while enhancing interpretability through these steps:...

[Summary] VGGT: Visual Geometry Grounded Transformer

TL;DR Traditional 3D reconstruction relied on iterative visual-geometry optimization (e.g., Bundle Adjustment). Recent work explored integrating machine learning via differentiable Bundle Adjustment, but remained slow and limited. VGGT (Visual Geometry Grounded Transformer) is a large feed-forward transformer that predicts all key 3D scene attributes—camera parameters, depth maps, point maps, and 3D point tracks—directly from one or many images in a single forward pass. It removes the need for geometry processing, achieves state-of-the-art results in multiple benchmarks, and runs in under a second....

[Summary] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TL;DR Transformer models often map multiple concepts to the same neuron, making it unclear what features they learn. This work makes inner representations interpretable by using a sparse autoencoder layer to map neurons to concepts. This method extracts relatively monosemantic concepts, can steer transformer generation, and shows that 512 neurons can represent tens of thousands of features. Method A major challenge in reverse engineering neural networks is the curse of dimensionality: as models grow, the latent space volume increases exponentially....

[Summary] Relightable Gaussian Codec Avatars

TL;DR Photorealistic head avatars is a required technology for virtual and augmented reality. However, current approaches either lack the fidelity to capture fine details (like hair) or are too slow for real-time use. Relightable Gaussian Codec Avatars proposes using (i) 3D Gaussian splatting for efficient geometry representation and (ii) a learnable radiance transfer model for appearance, including an explicit eye model. Background Image relighting is the task of showing what a scene from a source image would look like if illuminated differently....

[Summary] Training Vision Transformers with Only 2040 Images

TL;DR Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) with sufficient data but are data-hungry, limiting their use with small datasets. The authors propose a method to train ViTs with limited data by pre-training with label smoothing, lower resolution images, and parametric instance discrimination, followed by fine-tuning on the target task. Method Training a Vision Transformer on small datasets involves two steps Self-supervised pretraining: Parametric instance discrimination: Classify each image as its own class....

[Summary] ContraNorm: A Contrastive Learning Per-spective on Oversmoothing and beyond

TL;DR Oversmoothing is a common phenomenon in Transformers, where performance worsens due dimensional collapse of representations, where representations lie in a narrow cone in the feature space. The authors analyze the Contrastive lose and extracted the term that prevent this collapse. By taking a gradient descent step with respect to the feature later, they come up with the ContraNorm layer that leads to a more uniform distribution and prevents the dimensional collapse....

[Summary] ReAct: Synergizing Reasoning and Acting in Language Models

TL;DR Large Language Models (LLMs) often suffer from hallucinations. Two common mitigation strategies are Chain of Thought (CoT), where the LLM is prompted to show its step-by-step reasoning, and Act, where LLMs use external tools to ground their answers in reliable databases. However, CoT relies on the model’s internal representations, limiting its ability to reason reactively or update its knowledge. ReAct is a prompting method that combines CoT with action plan generation using external tools....