RoboRefer / RefSpatial

First 3D-aware reasoning VLM for multi-step spatial referring (NeurIPS 2025). Integrates a depth encoder via SFT and uses reinforcement fine-tuning (RFT) with metric-sensitive process reward functions for spatial tasks. Includes RefSpatial dataset (20M QA pairs, 31 spatial relations, up to 5-step reasoning) and RefSpatial-Bench (277 challenging samples). RFT variant surpasses Gemini-2.5-Pro by 17.4% on RefSpatial-Bench. Integrable with UR5, G1 humanoid, and other robots for real-world tasks.

Paper (arXiv)GitHub HuggingFace (RefSpatial-Bench)NeurIPS 2025

Paper

Venue NeurIPS 2025

arXiv HTML

embodiedreasoningopen-weight

RoboRefer / RefSpatial

Your notes

Paper

Related