Cross-Attention

What It Is

An attention mechanism where queries come from one sequence (the target) and keys/values come from a different sequence (the source) — allowing one modality or representation to selectively attend to information from another.

Why It Matters

Cross-attention is how conditioning works in generative models. In Latent Diffusion Models, the denoising U-Net queries the noisy image latent while attending to a CLIP text embedding — this is how text prompts guide image generation. In the original Transformer, the decoder uses cross-attention to attend to encoder outputs during translation.

How It Works

Given queries $Q$ from the target sequence and keys/values $K, V$ from the source sequence:

$CrossAttn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V$

This is identical to self-attention in form, but $Q$ and $K, V$ are projected from different inputs. In LDMs, $Q = W_{Q} \cdot z_{t}$ (noisy latent at timestep $t$ ) and $K, V = W_{K / V} \cdot τ_{θ} (y)$ (projected text embedding). Each spatial position in the image can attend to any token in the text prompt.

Key Sources

latent-diffusion-models-high-resolution-image-synthesis
attention-is-all-you-need
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
flash-attention-2
flash-attention-fast-and-memory-efficient-exact-attention
flamingo-visual-language-model-few-shot-learning — gated xattn-dense layers: cross-attention from frozen LM language tokens to 64 Perceiver Resampler visual tokens, tanh-gated and initialized to zero output
blip-2-bootstrapping-language-image-pretraining — Q-Former uses cross-attention from 32 learnable queries to frozen ViT patch embeddings
gqa-grouped-query-attention
mistral-7b
numina-counting-text-to-video
rope-rotary-position-embedding
alphafold-2-protein-structure-prediction
blip-2-bootstrapping-language-image-pretraining — Q-Former queries cross-attend to frozen ViT patch embeddings to extract language-relevant visual features
alibi-train-short-test-long

ML Wiki

Explorer

Cross-Attention

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Cross-Attention

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks