Full Attention Strikes Back (RTPurbo)

Shows that full-attention LLMs are already intrinsically sparse and can be converted into highly sparse models with only a few hundred training steps — sidestepping the usual trade-off between native sparse pretraining (expensive) and heuristic token eviction (lossy). Three observations drive the RTPurbo recipe: only a small subset of attention heads truly needs full long-context processing; long-range retrieval is governed by a low-dimensional subspace, so relevant tokens can be retrieved with a 16-dimensional indexer; and the useful token budget is query-dependent, favoring dynamic top-p over fixed top-k selection. RTPurbo keeps the full KV cache only for retrieval heads (~15%) and adds a lightweight token indexer for the rest.

Near-lossless accuracy on long-context and reasoning benchmarks with up to 9.36× prefill speedup at 1M context and ~2.01× decode speedup, demonstrated by sparsifying Qwen3-Coder-30B-A3B and Qwen3-30B-A3B-Think. A post-hoc alternative to natively-sparse designs like DeepSeek Sparse Attention and Kimi Delta Attention. Joint work by Alibaba Group and Nanjing University (first author an NJU intern at Alibaba; project lead Hanlin Tang, Alibaba).

Paper (arXiv)

Paper

arXiv HTML

architectureefficiencyresearch

Full Attention Strikes Back (RTPurbo)

Your notes

Paper

Related