Visual-conditional multi-view diffusion model trained on large-scale internet videos for open-world 3D creation (CVPR 2025 Highlight, ~3% acceptance rate). Introduces warping-based 3D generation without camera parameters, using visual conditioning for camera-controllable and geometrically consistent multi-view images. Supports text-to-3D, single-view-to-3D, sparse-view-to-3D, 3D editing, and Gaussian rendering. Trained on WebVi3D dataset (320M frames from 16M video clips).

Paper

Venue: CVPR 2025 Highlight

3d-generationmultimodalopen-weight