Scaling Laws for Neural Language Models
paperEstablished power-law scaling relationships between language model performance and three key factors: model size (N), dataset size (D), and compute budget (C). Showed that performance improves predictably as a smooth power law, with model size being the most important factor for a fixed compute budget.
This paper became the intellectual foundation for the "scaling hypothesis" that drove billions in compute investment across the industry. Its conclusions — that larger models are more sample-efficient and that optimal allocation favors model size over data — directly shaped GPT-3 and GPT-4 training decisions. Later refined by DeepMind's Chinchilla scaling laws. By Kaplan, McCandlish, Henighan, Brown, Chess, Child, Gray, Radford, Wu, and Amodei.
Paper
arXiv: 2001.08361