First 3D-aware reasoning VLM for multi-step spatial referring (NeurIPS 2025). Integrates a depth encoder via SFT and uses reinforcement fine-tuning (RFT) with metric-sensitive process reward functions for spatial tasks. Includes RefSpatial dataset (20M QA pairs, 31 spatial relations, up to 5-step reasoning) and RefSpatial-Bench (277 challenging samples). RFT variant surpasses Gemini-2.5-Pro by 17.4% on RefSpatial-Bench. Integrable with UR5, G1 humanoid, and other robots for real-world tasks.

Paper

arXiv: 2506.04308

Venue: NeurIPS 2025

embodiedreasoningopen-weight

Related