Grouped Query Attention (GQA)
paperIntroduced grouped-query attention as a middle ground between multi-head and multi-query attention, with a recipe for uptraining existing checkpoints using only 5% of original pre-training compute.
GQA was rapidly adopted by Llama 2, Mistral, and most subsequent frontier models as the default attention mechanism, offering near multi-head quality at multi-query speed. EMNLP 2023. By Ainslie, Lee-Thorp, de Jong et al.
Paper
arXiv: 2305.13245
Venue: EMNLP 2023