Trinity Large | Lab Index

398B total / 13B active MoE (256 routed + 1 shared expert, top-4 routing, 1.56% activation ratio). 60 transformer + 6 dense layers, 3072 model dim, 48 query heads, 8 KV heads. Interleaved local (4096 window) and global attention with gated attention and depth-scaled sandwich norm. Sigmoid routing with SMEBU load balancing.

Trained on 17T tokens in 3 phases using 2,048 NVIDIA B300 GPUs in ~33 days (~$20M total for all Trinity models). Muon optimizer, zero loss spikes. 512K native context (tested to 1M). Data curated by DatologyAI (8T+ synthetic).

MMLU: 87.2 (instruct), MMLU-Pro: 75.3, GPQA-Diamond: 63.3. The reasoning variant Trinity-Large-Thinking (April 2026, AA Intelligence Index 32) became the #1 most-used open model in the US on OpenRouter. Apache 2.0.

Paper (arXiv)Dataset Paper (arXiv)HuggingFace OpenRouter

Model Details

Architecture MOE

Parameters 398B

Active params 13B

Context window 512,000

Training tokens 17T

Paper

arXiv HTML

moeopen-weightfrontierreasoning

Model Details

Paper

Related