Rex-Omni

3B-parameter MLLM that redefines object detection and visual perception as next-token prediction. Unifies detection, OCR, pointing, keypointing, and visual prompting. Trained on 22M data with GRPO-based reinforcement post-training. Achieves performance comparable to or exceeding regression-based models like DINO and Grounding DINO in zero-shot settings.

Paper (arXiv)GitHub HuggingFace Project Page

Outputs 2

model

3B MLLM for unified visual perception via next point prediction. Supports detection, OCR, pointing, keypointing, and visual prompting.

GitHub HuggingFace

Parameters 3B

Detect Anything via Next Point Prediction

paper

Presents Rex-Omni framework that redefines visual perception tasks as next-token prediction with two-stage training and GRPO post-training.

Paper (arXiv)

Venue CVPR 2026

arXiv HTML

visionmultimodalopen-source