What It Is

Early fusion is a multimodal architecture strategy where different modalities (e.g., image patches and text tokens) are combined into a single shared sequence and processed together from the first layer of a single backbone, as opposed to late fusion where separate encoders process each modality independently before combining their outputs.

Why It Matters

Late-fusion pipelines (frozen vision encoder + separate text decoder) accumulate complexity, make it hard to attribute improvements, and create scaling bottlenecks. Early fusion allows the model to learn cross-modal representations from the ground up, enabling tighter integration of visual and linguistic reasoning. This is especially beneficial for tasks requiring compositional understanding — spatial reasoning, relational grounding, and OCR-guided disambiguation — where late fusion struggles.

How It Works

In early-fusion Transformers, image patches (usually from a patch embedding layer) and tokenized text are concatenated into one sequence before the first attention layer. A hybrid attention mask handles the structural difference between modalities:

  • Image tokens attend bidirectionally to all other image tokens (like a vision encoder)
  • Text/task tokens attend causally to the full visual prefix plus preceding text tokens

This preserves the global visual context benefits of bidirectional attention while still supporting autoregressive generation for task outputs.

Key Sources

Open Questions

  • Does early fusion require significantly more compute than late fusion at equivalent model scale?
  • How does early fusion handle very long image sequences (e.g., high-resolution, video)?
  • Is the hybrid attention mask optimal, or would full bidirectional or full causal attention work as well?