Mistral's debut model and a landmark in efficient open-weight LLMs. 7.3B dense parameters with two key innovations: Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) for efficient long-context handling. 32K context.

Outperformed Llama 2 13B on all benchmarks and Llama 1 34B on reasoning, math, and code. MMLU: 60.1%, HellaSwag: 84.0%. Apache 2.0. Spawned an enormous ecosystem of fine-tunes and derivatives across the open-source community.

Model Details

Architecture DENSE
Parameters 7.3B
Context window 32,000

Paper

arXiv: 2310.06825

open-weightefficiency

Related