Early Fusion

What It Is

Early fusion is a multimodal architecture strategy where different modalities (e.g., image patches and text tokens) are combined into a single shared sequence and processed together from the first layer of a single backbone, as opposed to late fusion where separate encoders process each modality independently before combining their outputs.

Why It Matters

Late-fusion pipelines (frozen vision encoder + separate text decoder) accumulate complexity, make it hard to attribute improvements, and create scaling bottlenecks. Early fusion allows the model to learn cross-modal representations from the ground up, enabling tighter integration of visual and linguistic reasoning. This is especially beneficial for tasks requiring compositional understanding — spatial reasoning, relational grounding, and OCR-guided disambiguation — where late fusion struggles.

How It Works

In early-fusion Transformers, image patches (usually from a patch embedding layer) and tokenized text are concatenated into one sequence before the first attention layer. A hybrid attention mask handles the structural difference between modalities:

Image tokens attend bidirectionally to all other image tokens (like a vision encoder)
Text/task tokens attend causally to the full visual prefix plus preceding text tokens

This preserves the global visual context benefits of bidirectional attention while still supporting autoregressive generation for task outputs.

Key Sources

falcon-perception-vlm

Open Questions

Does early fusion require significantly more compute than late fusion at equivalent model scale?
How does early fusion handle very long image sequences (e.g., high-resolution, video)?
Is the hybrid attention mask optimal, or would full bidirectional or full causal attention work as well?

ML Wiki

Explorer

Early Fusion

What It Is

Why It Matters

How It Works

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Early Fusion

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks