Exclusive Self Attention

Identifies a previously unreported "attention similarity bias": in standard self-attention, the output vector y_i has very high cosine similarity with the token's own value vector v_i, suggesting that self-attention spends substantial capacity modeling a pointwise feature transform rather than mixing context.

Proposes Exclusive Self Attention (XSA), a two-line modification that projects out the self-value component: z_i = y_i − (y_i^Tv_i) v_i / ‖v_i‖². Constrains attention to capture only information orthogonal to the token's own value, so the residual stream (not attention) handles the self position.

Tested at 0.7B / 1.4B / 2.7B on FineWeb-100BT (100B tokens, 2048 context, AdamW + cosine, 200K iterations). XSA consistently outperforms standard SA across all sizes, with the largest gains at 2.7B (+1.36 avg across 8 downstream tasks: ARC-Easy, BoolQ, HellaSwag, LAMBADA, OpenBookQA, PIQA, SocialIQA, WinoGrande) and increasingly larger gains as sequence length grows to 16K. Minimal compute and memory overhead. By Shuangfei Zhai (Apple).

Paper (arXiv)Apple ML Research Explainer video: "We've Been Doing Attention Wrong (2-Line Fix)" (Jia-Bin Huang)

Paper

arXiv HTML

foundationalarchitectureattention