Cohere's current flagship. 111B dense Transformer with a novel hybrid attention: 3 sliding-window attention layers (4096 window) + 1 global attention layer without positional embeddings, using RoPE. 256K context, 23 languages. Uses model merging rather than MoE to achieve expert-level breadth.

150% higher throughput vs Command R+ 08-2024. Runs on 2 GPUs (A100/H100). Trained via decentralized training and self-refinement algorithms. AA Intelligence Index: 13. CC-BY-NC. Command A Vision (Jul 2025) adds SigLIP2 vision encoder.

Model Details

Architecture DENSE
Parameters 111B
Context window 256,000

Paper

arXiv: 2504.00698

open-weightenterprisemultilingualfrontier

Related