What It Is
An attention mechanism where queries come from one sequence (the target) and keys/values come from a different sequence (the source) — allowing one modality or representation to selectively attend to information from another.
Why It Matters
Cross-attention is how conditioning works in generative models. In Latent Diffusion Models, the denoising U-Net queries the noisy image latent while attending to a CLIP text embedding — this is how text prompts guide image generation. In the original Transformer, the decoder uses cross-attention to attend to encoder outputs during translation.
How It Works
Given queries from the target sequence and keys/values from the source sequence:
This is identical to self-attention in form, but and are projected from different inputs. In LDMs, (noisy latent at timestep ) and (projected text embedding). Each spatial position in the image can attend to any token in the text prompt.