From DETR to RF-DETR: The Evolution of End-to-End Object Detection

TL;DR Object detection has shifted from heavy, hand-engineered pipelines based on anchors and heuristics to end to end transformer architectures that learn object localization and classification jointly. This progression (from DETR 2020 to RF-DETR 2025) has reduced post-processing, improved training stability, and brought real-time inference within reach. DETR: End-to-End Object Detection with Transformers (2020) DEtection TRansformer (DETR) introduced a simple yet novel idea: formulate object detection as a direct set prediction problem solved with transformers....

October 31, 2025 · 4 min · 756 words

[Summary] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

TL;DR Grounding DINO is an open-set object detector that integrates natural language supervision into the DETR-style DINO framework: Instead of being limited to a fixed set of classes, it allows specifying text prompts (e.g., zebra, traffic light) and find those objects within images at inference time. The model achieves this by coupling image and text representations throughout its architecture using cross-modality attention and language-conditioned query mechanisms. Motivation Closed‑set detectors are limited to a fixed label list and cannot recognize unseen categories without new annotations and retraining....

October 13, 2025 · 3 min · 584 words