Aya Vision
modelMultilingual multimodal model (8B and 32B) using Command R7B base + SigLIP2 vision encoder with multimodal adapter. 23 languages. 8B beats Qwen-2.5-VL-7B, Pixtral 12B, and Llama-3.2-90B-Vision. 32B outperforms Molmo-72B. CC-BY-NC-4.0.
Model Details
Architecture DENSE
Parameters 32B
Variants
| Name | Parameters | Notes |
|---|---|---|
| Aya Vision 8B | 8B | — |
| Aya Vision 32B | 32B | — |
Paper
arXiv: 2505.08751