SOLAR 10.7B
modelIntroduced Depth Up-Scaling (DUS): take a 32-layer Mistral 7B, duplicate it, remove the final 8 layers from one copy and the first 8 from the other, concatenate into a 48-layer model, then continue pretraining. This yields a 10.7B-parameter model (4096 hidden dim, 32 attention heads, 8 KV heads via GQA) that outperformed Mixtral 8x7B-Instruct (47B), Qwen 72B, Llama 2 70B, and Falcon 180B on the H6 benchmark (74.20 avg).
Alignment used sDPO (stepwise DPO) with easy-to-hard data ordering. MMLU: 66.21, GSM8K: 64.75, TruthfulQA: 71.43, ARC: 71.08, HellaSwag: 88.16. Apache 2.0 license.
Model Details
Architecture DENSE
Parameters 10.7B
Paper
arXiv: 2312.15166