100B dense Transformer with QK Normalization and Z-Loss for training stability. Trained on 2T tokens (1.3T English, 0.7T Japanese) in two phases on NVIDIA H100 GPUs with FP8. Funded under Japan's GENIAC/NEDO program.

Beats GPT-4 on Japanese benchmarks: Jaster 0-shot avg 0.738 (vs GPT-4 0.722), 4-shot avg 0.775 (vs 0.772). Japanese MT-Bench: 7.78.

Model Details

Architecture DENSE
Parameters 100B

Paper

arXiv: 2410.07563

open-weightmultilingual

Related