The sparse-attention efficiency technique behind GLM-5.2 (productized there as IndexShare; the paper calls it IndexCache). DeepSeek Sparse Attention (DSA) uses a lightweight "lightning indexer" to select the top-k relevant tokens per query, reducing core attention from O(L²) to O(Lk) — but the indexer itself stays O(L²) and reruns at every layer, even though its top-k selections are highly similar across consecutive layers.

IndexCache exploits that cross-layer redundancy by partitioning layers into a few Full layers that run their own indexer and a majority of Shared layers that reuse the nearest Full layer's top-k indices. In GLM-5.2 this reuses one indexer across every four sparse-attention layers, cutting per-token FLOPs 2.9× at 1M-token context with negligible quality loss. By Tsinghua University and Z.ai.

Paper

Authors: Yushi Bai · Qian Dong · Ting Jiang · Xin Lv · Zhengxiao Du · Aohan Zeng · Jie Tang · Juanzi Li
architectureefficiencyresearch

Related