DINO-X

Unified vision model for open-world object detection and understanding. Supports text, visual, and customized prompts including prompt-free universal object detection. Trained on Grounding-100M dataset with 100M+ high-quality grounding samples. Sets new SOTA: 56.0 AP on COCO, 59.8 AP on LVIS-minival, 52.4 AP on LVIS-val in zero-shot settings. Supports detection, segmentation, pose estimation, and region captioning.

Paper (arXiv)GitHub (API)

Outputs 2

model

World's top-performing vision model for open-world object detection. Pro and Edge variants with text, visual, and customized prompt support.

GitHub (API)

Variants

Name	Parameters	Notes
DINO-X Pro	—	SOTA 56.0 AP COCO, 59.8 AP LVIS-minival zero-shot
DINO-X Edge	—	Efficient variant for edge deployment

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

paper

Presents DINO-X with universal object prompt and Grounding-100M dataset for prompt-free open-world detection and understanding.

Paper (arXiv)

Citations 4

arXiv HTML

visionopen-vocabulary

Your notes

Outputs 2

DINO-X

Variants

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding