Training-free KV cache compression built on the observation that most attention heads only attend locally while a few retrieval heads attend across the full input. RazorAttention keeps a full KV cache only for the retrieval heads and drops remote tokens in all other heads, adding a "compensation token" that recovers information from the discarded tokens — unlike token-eviction methods, no token's information is irreversibly erased. Cuts KV cache size by over 70% with no noticeable performance impact, is FlashAttention-compatible, and plugs in without retraining the model. ICLR 2025.

An early articulation of the retrieval-heads view of attention sparsity that later transfer-to-sparse work builds on — Alibaba/NJU's RTPurbo (2026) cites it for the head-level sparsity observation. All seven authors at Huawei Technologies.

Paper

Venue ICLR 2025
efficiencyresearch

Related