Predictable Scale Part I: Step Law
paperEmpirical study across 3,700+ LLMs trained on 100T tokens establishing optimal hyperparameter scaling laws. Finds optimal learning rate follows a power-law with model and dataset size, while optimal batch size depends mainly on dataset size.
Paper
arXiv: 2503.04715