ScaleAcross Explorer

"ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training." Systems paper from Meta's SysML research org tackling multi-datacenter training — distributing a single LLM run across "a few data centers housing hundreds of thousands of GPUs," each with its own intra/inter-DC network characteristics.

Co-optimizes three design axes — parallelism placement, parallelism scheduling, and network layer technologies — via a single optimizer that considers their interactions instead of tuning each in isolation. Reports up to 64.62% training speedup over Meta's production configuration and 37.59% over the strongest published baseline in testbed experiments and simulations. Drawn from Meta's production experience training Llama at multi-DC scale.

28 pages, 27 figures. By Minghao Li, Alicia Golden, Samuel Hsia, Michael Kuchnik, Adi Gangidi, Xu Zhang, Ashmitha Jeevaraj Shetty, Zachary DeVito, Weiwei Chu, Dong He, Haoci Zhang, Yuchen Hao, Ruoming Pang, James Hongyi Zeng, Ying Zhang, Minlan Yu, and Carole-Jean Wu (Meta + Harvard).

Paper (arXiv)Paper (HTML)

Paper

arXiv HTML

Authors: Minghao Li · Alicia Golden · Samuel Hsia · Michael Kuchnik · Adi Gangidi · Xu Zhang · Ashmitha Jeevaraj Shetty · Zachary DeVito · Weiwei Chu · Dong He · Haoci Zhang · Yuchen Hao · Ruoming Pang · James Hongyi Zeng · Ying Zhang · Minlan Yu · Carole-Jean Wu

infrastructuretrainingfoundationalresearch

Your notes

Paper