Awarded **Best Paper at NeurIPS 2025**, this research introduces a foundational architectural modification for transformers: applying a head-specific sigmoid gate immediately after Scaled Dot-Product Attention (SDPA). This simple change introduces non-linearity, enables query-dependent sparse gating, and effectively eliminates "attention sinks." The mechanism significantly improves training stability, allows for higher learning rates, and enables superior long-context extrapolation (up to 1M+ tokens). It is a core innovation in the **Qwen3** and **Qwen3.5** series, often paired with linear attention variants in hybrid configurations.

Paper

arXiv: 2505.06708

architectureefficiencyscalingresearch

Related