3B-parameter MLLM that redefines object detection and visual perception as next-token prediction. Unifies detection, OCR, pointing, keypointing, and visual prompting. Trained on 22M data with GRPO-based reinforcement post-training. Achieves performance comparable to or exceeding regression-based models like DINO and Grounding DINO in zero-shot settings.

Outputs 2

Rex-Omni

model

3B MLLM for unified visual perception via next point prediction. Supports detection, OCR, pointing, keypointing, and visual prompting.

Parameters 3B

Detect Anything via Next Point Prediction

paper

Presents Rex-Omni framework that redefines visual perception tasks as next-token prediction with two-stage training and GRPO post-training.

arXiv: 2510.12798

Venue: CVPR 2026

visionmultimodalopen-source