Mixture-of-Depths Attention (MoDA)
paperNovel attention mechanism where each attention head can attend to both sequence KV pairs at the current layer and depth KV pairs from preceding layers, addressing signal degradation in deep Transformers. Introduces a hardware-efficient algorithm that achieves 97.3% of FlashAttention-2's efficiency at 64K sequence length by resolving memory access pattern challenges inherent in cross-depth attention.
At 1.5B scale: 0.2 perplexity reduction across benchmarks and 2.11% average gain on downstream tasks with only 3.7% additional compute overhead. Shows that MoDA works better with post-norm than pre-norm architectures. Joint work with Huazhong University of Science & Technology (HUST). By Zhu, Fang, Liao, Wang, Cheng, Huang et al. (ByteDance Seed + HUST).