From DETR to RF-DETR: The Evolution of End-to-End Object Detection

TL;DR Object detection has shifted from heavy, hand-engineered pipelines based on anchors and heuristics to end to end transformer architectures that learn object localization and classification jointly. This progression (from DETR 2020 to RF-DETR 2025) has reduced post-processing, improved training stability, and brought real-time inference within reach. DETR: End-to-End Object Detection with Transformers (2020) DEtection TRansformer (DETR) introduced a simple yet novel idea: formulate object detection as a direct set prediction problem solved with transformers....

October 31, 2025 · 4 min · 756 words

[Summary] Training Vision Transformers with Only 2040 Images

TL;DR Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) with sufficient data but are data-hungry, limiting their use with small datasets. The authors propose a method to train ViTs with limited data by pre-training with label smoothing, lower resolution images, and parametric instance discrimination, followed by fine-tuning on the target task. Method Training a Vision Transformer on small datasets involves two steps Self-supervised pretraining: Parametric instance discrimination: Classify each image as its own class....

February 15, 2025 · 2 min · 217 words