Falcon Mamba
modelFirst competitive pure-Mamba (attention-free) 7B LLM. 64 layers, 4096 hidden dim, trained on 5.8T tokens using 256 H100s. Progressive context (2K→8K training), constant-memory inference at 130K+ tokens. Demonstrates SSMs can match Transformer quality at the 7B scale.
Model Details
Architecture DENSE
Parameters 7.27B
Context window 8,192
Paper
arXiv: 2410.05355