A training-infrastructure system for multi-task agentic reinforcement learning on disaggregated hardware. Agentic RL workloads mix compute-bound prefill, bandwidth-bound decode, CPU-heavy environment execution, and bursty reward evaluation; RollArt maps each pipeline stage to best-fit hardware — prefill-heavy tasks to compute-optimized GPUs, decode-heavy to bandwidth-optimized GPUs, and environments to CPU clusters — instead of colocating everything or decoupling coarsely.

It decouples rollout at the trajectory level so generation, environment interaction, and reward scoring proceed independently (a slow or failed environment never blocks the others), offloads stateless reward computation to serverless infrastructure, and overlaps rollout with training via staleness-bounded asynchronous weight sync. Reports 1.31–2.05× training-time reduction over prior RL systems, and was validated training a hundreds-of-billions-parameter MoE for Alibaba's Qoder coding product on a >3,000-GPU cluster. Built on Alibaba's open ROLL framework. By Alibaba Group / Tongyi Lab with HKUST.

Paper

traininginfrastructurereinforcement-learningagenticresearch