Unified vision model for open-world object detection and understanding. Supports text, visual, and customized prompts including prompt-free universal object detection. Trained on Grounding-100M dataset with 100M+ high-quality grounding samples. Sets new SOTA: 56.0 AP on COCO, 59.8 AP on LVIS-minival, 52.4 AP on LVIS-val in zero-shot settings. Supports detection, segmentation, pose estimation, and region captioning.

Outputs 2

DINO-X

model

World's top-performing vision model for open-world object detection. Pro and Edge variants with text, visual, and customized prompt support.

Variants

Name Parameters Notes
DINO-X Pro SOTA 56.0 AP COCO, 59.8 AP LVIS-minival zero-shot
DINO-X Edge Efficient variant for edge deployment

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

paper

Presents DINO-X with universal object prompt and Grounding-100M dataset for prompt-free open-world detection and understanding.

arXiv: 2411.14347

visionopen-vocabulary