Parallel Loop Transformer

Proposes Parallel Loop Transformer (PLT), which eliminates the sequential bottleneck of looped Transformers by computing different loops for different tokens simultaneously within a single pass. Introduces three key techniques: Cross-Loop Parallelism (CLP) for parallel loop execution, KV-cache sharing from the first loop to prevent memory expansion, and Gated Sliding-Window Attention (G-SWA) to balance global and local context.

Validated on ByteDance's internal Seed-MoE models (680M/13B and 2.5B/60B) and open-source dense/MoE settings. PLT-2 on a 1.7B/40B MoE reaches 77.3 MMLU and 80.5 CEval, matching a 2.5B/60B baseline at ~30% lower latency. PLT-2 reduces inference latency by 47% versus vanilla looped Transformers with only 1.4% KV-cache overhead. Enables efficient test-time computation scaling without the parameter or latency cost of deeper models.

Paper (arXiv)

Paper

arXiv HTML

efficiencyarchitecture