Exclusive Self Attention
paperIdentifies a previously unreported "attention similarity bias": in standard self-attention, the output vector yi has very high cosine similarity with the token's own value vector vi, suggesting that self-attention spends substantial capacity modeling a pointwise feature transform rather than mixing context.
Proposes Exclusive Self Attention (XSA), a two-line modification that projects out the self-value component: zi = yi − (yiTvi) vi / ‖vi‖2. Constrains attention to capture only information orthogonal to the token's own value, so the residual stream (not attention) handles the self position.
Tested at 0.7B / 1.4B / 2.7B on FineWeb-100BT (100B tokens, 2048 context, AdamW + cosine, 200K iterations). XSA consistently outperforms standard SA across all sizes, with the largest gains at 2.7B (+1.36 avg across 8 downstream tasks: ARC-Easy, BoolQ, HellaSwag, LAMBADA, OpenBookQA, PIQA, SocialIQA, WinoGrande) and increasingly larger gains as sequence length grows to 16K. Minimal compute and memory overhead. By Shuangfei Zhai (Apple).