Sapiens2 | Lab Index

Transformer-based human-centric vision foundation family from Meta. 0.1B–5B parameters, native 1K resolution with hierarchical 4K-capable variants and windowed attention for extended spatial reasoning. Trained on 1 billion curated human images via masked image reconstruction combined with self-distilled contrastive learning; 4K models pretrain at 2K output resolution.

Five downstream task heads released open: pretrain (feature extractor), pose (keypoint detection), seg (body-part segmentation), normal (surface-normal estimation), and pointmap (depth/geometry — albedo discussed in the paper but not in the public collection). Reported gains over Sapiens v1: +4 mAP pose, +24.3 mIoU body-part segmentation, −45.6% normal angular error. ICLR 2026.

By Khirodkar, Wen, Martinez, Dong, Zhaoen, and Saito (Meta Reality Labs).

Paper (arXiv)HuggingFace collection

Model Details

Architecture DENSE

Parameters 5B

Variants

Name	Parameters	Notes
Sapiens2 0.1B (pretrain only)	0.1B	—
Sapiens2 0.4B	0.4B	Pretrain + pose + seg + normal + pointmap heads
Sapiens2 0.8B	0.8B	Pretrain + pose + seg + normal + pointmap heads
Sapiens2 1B	1B	Pretrain + pose + seg + normal + pointmap heads
Sapiens2 5B	5B	Pretrain + pose + seg + normal + pointmap heads; 4K-capable hierarchical variant

Paper

Venue ICLR 2026

arXiv HTML

Authors: Rawal Khirodkar · He Wen · Julieta Martinez · Yuan Dong · Su Zhaoen · Shunsuke Saito

visionopen-weightfoundationalhuman-centric

Your notes

Model Details

Variants

Paper