RoboRefer / RefSpatial
modelFirst 3D-aware reasoning VLM for multi-step spatial referring (NeurIPS 2025). Integrates a depth encoder via SFT and uses reinforcement fine-tuning (RFT) with metric-sensitive process reward functions for spatial tasks. Includes RefSpatial dataset (20M QA pairs, 31 spatial relations, up to 5-step reasoning) and RefSpatial-Bench (277 challenging samples). RFT variant surpasses Gemini-2.5-Pro by 17.4% on RefSpatial-Bench. Integrable with UR5, G1 humanoid, and other robots for real-world tasks.
Paper
arXiv: 2506.04308
Venue: NeurIPS 2025