Grouped Query Attention (GQA)

What It Is

Grouped Query Attention (GQA) is an attention mechanism where multiple query heads share a single set of key and value heads, reducing the KV cache size by a factor equal to the grouping ratio. In standard multi-head attention (MHA), every query head has its own K/V pair. In GQA, $n_{q}$ query heads are divided into $G$ groups, each sharing one K/V head, so only $G$ K/V pairs are cached instead of $n_{q}$ .

Why It Matters

The KV cache is the primary memory bottleneck for LLM inference. For a model with 32 query heads and head dimension 128, each generated token extends the KV cache by $32 \times 128 \times 2 \times 2 = 16, 384$ bytes per layer. With GQA using 8 K/V heads instead of 32, that drops to 4,096 bytes — a 4× reduction. For a 32-layer model serving an 8K-token context, this saves roughly 3 GB of GPU memory that can instead serve more concurrent requests.

How It Works

For query head $h$ , its K/V head is determined by:

$g (h) = ⌊ \frac{h \cdot n _{k v}}{n _{q}} ⌋$

The attention computation for each head uses the shared K/V:

$head_{h} = softmax (\frac{Q _{h} K _{g (h)}^{⊤}}{d _{k}}) V_{g (h)}$

Query heads within a group produce different outputs because their $Q_{h}$ projections differ — only the source material (K, V) is shared. Empirically, the accuracy cost of this sharing is small: the diversity in attention behavior comes from the query side, not the key/value side.

Multi-Query Attention (MQA) is the extreme case where $n_{k v} = 1$ — all query heads share a single K/V head. GQA interpolates between MHA ( $n_{k v} = n_{q}$ ) and MQA ( $n_{k v} = 1$ ).

Key Sources

mistral-7b — uses GQA with 32 query heads and 8 K/V heads
llama-2-open-foundation-fine-tuned-chat-models — uses GQA for the 34B and 70B variants
gqa-grouped-query-attention

ML Wiki

Explorer

Grouped Query Attention (GQA)

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Grouped Query Attention (GQA)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks