"Training Compute-Optimal Large Language Models." Found that existing LLMs were significantly undertrained: for compute-optimal training, model size and training tokens should scale equally. Chinchilla (70B, 1.4T tokens) outperformed Gopher (280B, 300B tokens) on MMLU (67.5% vs 60%).

The Chinchilla scaling laws reshaped training decisions industry-wide, shifting labs from building the largest possible models to training smaller models on far more data. Directly influenced Llama, Mistral, and other efficient model families. NeurIPS 2022. By Hoffmann et al. (DeepMind).

Paper

arXiv: 2203.15556

Venue: NeurIPS 2022

foundationalresearch

Related