What It Is

Grouped Query Attention (GQA) is an attention mechanism where multiple query heads share a single set of key and value heads, reducing the KV cache size by a factor equal to the grouping ratio. In standard multi-head attention (MHA), every query head has its own K/V pair. In GQA, query heads are divided into groups, each sharing one K/V head, so only K/V pairs are cached instead of .

Why It Matters

The KV cache is the primary memory bottleneck for LLM inference. For a model with 32 query heads and head dimension 128, each generated token extends the KV cache by bytes per layer. With GQA using 8 K/V heads instead of 32, that drops to 4,096 bytes — a 4× reduction. For a 32-layer model serving an 8K-token context, this saves roughly 3 GB of GPU memory that can instead serve more concurrent requests.

How It Works

For query head , its K/V head is determined by:

The attention computation for each head uses the shared K/V:

Query heads within a group produce different outputs because their projections differ — only the source material (K, V) is shared. Empirically, the accuracy cost of this sharing is small: the diversity in attention behavior comes from the query side, not the key/value side.

Multi-Query Attention (MQA) is the extreme case where — all query heads share a single K/V head. GQA interpolates between MHA () and MQA ().

Key Sources