Grouped Query Attention (GQA)

Introduced grouped-query attention as a middle ground between multi-head and multi-query attention, with a recipe for uptraining existing checkpoints using only 5% of original pre-training compute.

GQA was rapidly adopted by Llama 2, Mistral, and most subsequent frontier models as the default attention mechanism, offering near multi-head quality at multi-query speed. EMNLP 2023. By Ainslie, Lee-Thorp, de Jong et al.

Paper (arXiv)

Paper

Venue EMNLP 2023

arXiv HTML

foundationalefficiency

Paper

Related