Multilingual multimodal model (8B and 32B) using Command R7B base + SigLIP2 vision encoder with multimodal adapter. 23 languages. 8B beats Qwen-2.5-VL-7B, Pixtral 12B, and Llama-3.2-90B-Vision. 32B outperforms Molmo-72B. CC-BY-NC-4.0.

Model Details

Architecture DENSE
Parameters 32B

Variants

Name Parameters Notes
Aya Vision 8B 8B
Aya Vision 32B 32B

Paper

arXiv: 2505.08751

multimodalmultilingualopen-weightvision

Related