SOLAR 10.7B | Lab Index

Introduced Depth Up-Scaling (DUS): take a 32-layer Mistral 7B, duplicate it, remove the final 8 layers from one copy and the first 8 from the other, concatenate into a 48-layer model, then continue pretraining. This yields a 10.7B-parameter model (4096 hidden dim, 32 attention heads, 8 KV heads via GQA) that outperformed Mixtral 8x7B-Instruct (47B), Qwen 72B, Llama 2 70B, and Falcon 180B on the H6 benchmark (74.20 avg).

Alignment used sDPO (stepwise DPO) with easy-to-hard data ordering. MMLU: 66.21, GSM8K: 64.75, TruthfulQA: 71.43, ARC: 71.08, HellaSwag: 88.16. Apache 2.0 license.

Paper (arXiv)HuggingFace Artificial Analysis

Model Details

Architecture DENSE

Parameters 10.7B

Paper

arXiv: 2312.15166

open-weightefficiencyresearch

Model Details

Paper

Related