What It Is

An attention mechanism where queries come from one sequence (the target) and keys/values come from a different sequence (the source) — allowing one modality or representation to selectively attend to information from another.

Why It Matters

Cross-attention is how conditioning works in generative models. In Latent Diffusion Models, the denoising U-Net queries the noisy image latent while attending to a CLIP text embedding — this is how text prompts guide image generation. In the original Transformer, the decoder uses cross-attention to attend to encoder outputs during translation.

How It Works

Given queries from the target sequence and keys/values from the source sequence:

This is identical to self-attention in form, but and are projected from different inputs. In LDMs, (noisy latent at timestep ) and (projected text embedding). Each spatial position in the image can attend to any token in the text prompt.

Key Sources