Transformer-based human-centric vision foundation family from Meta. 0.1B–5B parameters, native 1K resolution with hierarchical 4K-capable variants and windowed attention for extended spatial reasoning. Trained on 1 billion curated human images via masked image reconstruction combined with self-distilled contrastive learning; 4K models pretrain at 2K output resolution.

Five downstream task heads released open: pretrain (feature extractor), pose (keypoint detection), seg (body-part segmentation), normal (surface-normal estimation), and pointmap (depth/geometry — albedo discussed in the paper but not in the public collection). Reported gains over Sapiens v1: +4 mAP pose, +24.3 mIoU body-part segmentation, −45.6% normal angular error. ICLR 2026.

By Khirodkar, Wen, Martinez, Dong, Zhaoen, and Saito (Meta Reality Labs).

Model Details

Architecture DENSE
Parameters 5B

Variants

Name Parameters Notes
Sapiens2 0.1B (pretrain only) 0.1B
Sapiens2 0.4B 0.4B Pretrain + pose + seg + normal + pointmap heads
Sapiens2 0.8B 0.8B Pretrain + pose + seg + normal + pointmap heads
Sapiens2 1B 1B Pretrain + pose + seg + normal + pointmap heads
Sapiens2 5B 5B Pretrain + pose + seg + normal + pointmap heads; 4K-capable hierarchical variant

Paper

Venue ICLR 2026
Authors: Rawal Khirodkar · He Wen · Julieta Martinez · Yuan Dong · Su Zhaoen · Shunsuke Saito
visionopen-weightfoundationalhuman-centric