Proposes PrfaaS (Prefill-as-a-Service), an LLM serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the compressed KVCache over standard Ethernet to local decode clusters. Enables independent scaling of prefill and decode across geographically distributed datacenters — removing the tight coupling that current disaggregated serving systems require.

Key insight: hybrid-attention models (sliding window + global attention) produce much smaller KVCaches that fit within cross-datacenter bandwidth constraints. Combines bandwidth-aware scheduling with cache-aware request placement. Achieves 54% higher throughput vs. homogeneous baselines on a 1T-parameter model. Extends Moonshot's line of KVCache-centric serving research that began with Mooncake (FAST 2025 Best Paper). By Qin, He, Wang, Li, Xu (Moonshot AI) and Wu, Zheng, Zhang (Tsinghua).

Paper

infrastructureefficiency

Related