First competitive pure-Mamba (attention-free) 7B LLM. 64 layers, 4096 hidden dim, trained on 5.8T tokens using 256 H100s. Progressive context (2K→8K training), constant-memory inference at 130K+ tokens. Demonstrates SSMs can match Transformer quality at the 7B scale.

Model Details

Architecture DENSE
Parameters 7.27B
Context window 8,192

Paper

arXiv: 2410.05355

open-weightarchitecture

Related