Sapiens2
modelTransformer-based human-centric vision foundation family from Meta. 0.1B–5B parameters, native 1K resolution with hierarchical 4K-capable variants and windowed attention for extended spatial reasoning. Trained on 1 billion curated human images via masked image reconstruction combined with self-distilled contrastive learning; 4K models pretrain at 2K output resolution.
Five downstream task heads released open: pretrain (feature extractor), pose (keypoint detection), seg (body-part segmentation), normal (surface-normal estimation), and pointmap (depth/geometry — albedo discussed in the paper but not in the public collection). Reported gains over Sapiens v1: +4 mAP pose, +24.3 mIoU body-part segmentation, −45.6% normal angular error. ICLR 2026.
By Khirodkar, Wen, Martinez, Dong, Zhaoen, and Saito (Meta Reality Labs).
Model Details
Variants
| Name | Parameters | Notes |
|---|---|---|
| Sapiens2 0.1B (pretrain only) | 0.1B | — |
| Sapiens2 0.4B | 0.4B | Pretrain + pose + seg + normal + pointmap heads |
| Sapiens2 0.8B | 0.8B | Pretrain + pose + seg + normal + pointmap heads |
| Sapiens2 1B | 1B | Pretrain + pose + seg + normal + pointmap heads |
| Sapiens2 5B | 5B | Pretrain + pose + seg + normal + pointmap heads; 4K-capable hierarchical variant |