Introduces Group Expectation Policy Optimization (GEPO), an asynchronous RL algorithm robust to latency in geographically distributed computing networks. GEPO uses group expectation weighting to reduce variance in importance weights, enabling stable training of large models across heterogeneous nodes.

Paper

arXiv: 2508.17850

traininginfrastructureresearch

Related