Prefill-as-a-Service: Cross-Datacenter KVCache for Next-Generation Models

Proposes PrfaaS (Prefill-as-a-Service), an LLM serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the compressed KVCache over standard Ethernet to local decode clusters. Enables independent scaling of prefill and decode across geographically distributed datacenters — removing the tight coupling that current disaggregated serving systems require.

Key insight: hybrid-attention models (sliding window + global attention) produce much smaller KVCaches that fit within cross-datacenter bandwidth constraints. Combines bandwidth-aware scheduling with cache-aware request placement. Achieves 54% higher throughput vs. homogeneous baselines on a 1T-parameter model. Extends Moonshot's line of KVCache-centric serving research that began with Mooncake (FAST 2025 Best Paper). By Qin, He, Wang, Li, Xu (Moonshot AI) and Wu, Zheng, Zhang (Tsinghua).

Paper (arXiv)

Paper

arXiv HTML

infrastructureefficiency

Prefill-as-a-Service: Cross-Datacenter KVCache for Next-Generation Models

Your notes

Paper

Related