[Summary] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
TL;DR Grounding DINO is an open-set object detector that integrates natural language supervision into the DETR-style DINO framework: Instead of being limited to a fixed set of classes, it allows specifying text prompts (e.g., zebra, traffic light) and find those objects within images at inference time. The model achieves this by coupling image and text representations throughout its architecture using cross-modality attention and language-conditioned query mechanisms. Motivation Closed‑set detectors are limited to a fixed label list and cannot recognize unseen categories without new annotations and retraining....