Rex-Omni
model paper3B-parameter MLLM that redefines object detection and visual perception as next-token prediction. Unifies detection, OCR, pointing, keypointing, and visual prompting. Trained on 22M data with GRPO-based reinforcement post-training. Achieves performance comparable to or exceeding regression-based models like DINO and Grounding DINO in zero-shot settings.
Outputs 2
Rex-Omni
model3B MLLM for unified visual perception via next point prediction. Supports detection, OCR, pointing, keypointing, and visual prompting.
Parameters 3B
Detect Anything via Next Point Prediction
paperPresents Rex-Omni framework that redefines visual perception tasks as next-token prediction with two-stage training and GRPO post-training.
arXiv: 2510.12798
Venue: CVPR 2026