LLaVA: Visual Instruction Tuning

GPT-4 came out in March 2023 and could accept image inputs — but nobody knew how it worked, and no open-source model could do anything similar. CLIP could align images and text into a shared embedding space. LLaMA could follow instructions. But connecting them — building a model that could see an image and answer questions about it in a conversational way — required careful glue, the right training data, and a key insight about what multimodal instruction tuning actually means. LLaVA (Large Language and Vision Assistant), published by Liu et al. in April 2023, provided all three.

The core idea

The analogy: You have two experts in different rooms. One speaks only in images — a vision encoder that can describe what it sees in a high-dimensional embedding. One speaks only in words — a language model that can reason and follow instructions. The problem: they can’t talk to each other. LLaVA builds a translator between them — a simple projection layer that converts visual embeddings into the same vocabulary the language model understands.

But the architecture is only half the innovation. The bigger contribution was the data. You can’t collect millions of humans having conversations about images and following instructions — that’s expensive and slow. Instead, Liu et al. used GPT-4 to generate the instruction-following data automatically, by feeding it image captions and bounding box descriptions (no actual images — text only) and prompting it to create:

Conversation data: multi-turn Q&A dialogues about the image
Detailed description data: long-form descriptions of what’s in the image
Complex reasoning data: questions requiring multi-step reasoning about the scene

This yielded 158,000 multimodal instruction-following examples, generated cheaply from existing image-text datasets like COCO, without requiring humans to look at images at all.

The mechanism, step by step

Architecture:

IMAGE
  |
[CLIP Vision Encoder: ViT-L/14]
  |
  -> 256 visual tokens (each a 1024-dim vector)
  |
[Linear Projection Layer W]  <- the only new parameters trained from scratch
  |
  -> 256 visual tokens in LLM embedding space (4096-dim for LLaMA-13B)
  |
[concatenated with text tokens for the instruction/question]
  |
[LLaMA Language Model]
  |
Output text response

The projection layer is a single matrix multiplication: $H_{v} = W \cdot Z_{v}$ , where $Z_{v} \in R^{N \times d_{v}}$ are the CLIP visual features and $W \in R^{d_{v} \times d_{l}}$ maps from CLIP’s visual space (dimension $d_{v} = 1024$ ) to LLaMA’s text embedding space (dimension $d_{l} = 4096$ ). That’s it — no cross-attention modules, no specialized multimodal fusion layers.

Training stages:

Stage 1 — Feature alignment (pretraining):

Freeze CLIP and LLaMA
Train only the projection layer $W$
Data: ~595K image-caption pairs from CC3M, filtered to 595K examples
Task: predict the caption given the image
Goal: align visual features with LLM’s word embeddings

Stage 2 — End-to-end visual instruction tuning:

Freeze CLIP
Train both the projection layer $W$ and all LLaMA weights (or use LoRA for efficiency)
Data: the 158K GPT-4-generated instruction-following examples + LLaVA-Bench
Task: follow visual instructions, answer questions about images, hold conversations
Loss: next-token prediction on answer tokens only — $L = - \sum_{t} lo g p_{θ} (y_{t} ∣ H_{v}, x_{< t}, y_{< t})$ , where the sum is over answer token positions and gradient is not propagated through instruction tokens

Data generation pipeline:

Source image annotation (COCO captions + bounding boxes)
  |
  → TEXT description passed to GPT-4 (no actual image!)
  → GPT-4 asked to generate: "If you saw this image, what questions would users ask?
                               What would be good answers?"
  → Output: conversation/description/reasoning examples
  |
Used as training data for LLaVA

The trick of using text-only GPT-4 to generate multimodal training data is elegant: GPT-4’s reasoning ability is preserved without needing multimodal API access, and the generated dialogues are higher quality than what cheaper models would produce.

Find the instinct

Why does a simple linear projection work?

This is the most surprising aspect of LLaVA. Intuitively, you might expect that aligning vision and language requires a complex bridging architecture — a Perceiver Resampler (as in Flamingo), a Q-Former with learned queries (as in BLIP-2), or extensive cross-attention layers. LLaVA shows that a single linear map $W \in R^{d_{v} \times d_{l}}$ suffices.

The reason is CLIP. CLIP was trained to align images and text in a shared embedding space — its visual encoder already produces representations that are semantically meaningful in a language-compatible way. The visual tokens from CLIP-ViT already “speak a language” that’s close to what LLMs understand. The projection $W$ doesn’t need to do heavy semantic lifting; it just needs to rescale and rotate the space slightly.

“We find that a simple linear layer can effectively connect visual features with language models, enabling visual understanding capabilities through instruction tuning.”

The insight generalizes: LLMs trained on enough text have already developed rich conceptual representations. Visual features from strong vision encoders are already semantically structured. The gap between them is smaller than it looks — you mostly need alignment, not translation.

Why generate instruction data with text-only GPT-4?

The prior approach to multimodal training was: gather image-text pairs (image + caption), train on next-token prediction. This produces models that can describe images but not follow instructions — they answer questions differently than they should in a conversational context.

LLaVA’s insight: the bottleneck isn’t visual understanding, it’s instruction-following in the visual domain. GPT-4 already knows how to reason about image descriptions in text — so you can mine its reasoning to generate training data that teaches the model to behave correctly when given visual instructions.

Results

On ScienceQA (a multimodal science question benchmark with ~21K questions):

GPT-4 (text-only, with captions): 82.69%
LLaVA fine-tuned: 90.92%
LLaVA + GPT-4 ensemble: 92.53% (new SOTA at publication time)

On LLaVA-Bench (a custom multimodal instruction benchmark, scored by GPT-4 relative to GPT-4’s own answers):

BLIP-2: 38.1% relative score
OpenFlamingo: 28.7%
LLaVA: 85.1% relative to GPT-4

The gap is large. The combination of strong CLIP features, LLaMA’s reasoning, a simple projection, and the GPT-4-generated instruction data produced qualitatively different capabilities.

Behaviors the paper highlights:

Describing detailed content in memes and cartoons
Identifying unusual elements in images
Multi-step reasoning about spatial relationships
Following complex visual instructions

What doesn’t work:

Struggles with fine-grained object counting
Can hallucinate details not present in the image
336px resolution limit from ViT-L/14 (high-res details are lost)
The projection layer alignment can fail when CLIP features don’t capture relevant details

Practical implications

LLaVA established the recipe that almost all open-source VLMs now follow:

Start with a strong frozen vision encoder (CLIP, SigLIP, or similar)
Start with a strong frozen LLM (LLaMA, Mistral, Qwen, etc.)
Add a simple projection/connector (linear, MLP, or small cross-attention)
Two-stage training: align features first, then instruction-tune end-to-end

The successors — LLaVA-1.5, LLaVA-NeXT, InstructBLIP, MiniGPT-4, Qwen-VL, and many others — all trace back to this architecture blueprint. Qwen2.5-VL, which uses a Vision Transformer connected to Qwen2.5 via a cross-attention mechanism, is a direct descendant of LLaVA’s philosophy.

Connections

multimodal-instruction-tuning — the technique introduced in this paper
contrastive-learning — CLIP (the vision encoder) is trained via contrastive learning
sft — Stage 2 of LLaVA training is visual instruction tuning, a form of SFT
in-context-learning — GPT-4 generates training data via in-context examples; the final model uses visual in-context learning
lora-low-rank-adaptation — LoRA is commonly used to fine-tune the LLM component in LLaVA variants efficiently
clip-learning-transferable-visual-models — provides the frozen vision encoder
attention-is-all-you-need — the underlying architecture for both the ViT encoder and the LLM

Citation

arXiv:2304.08485

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. NeurIPS 2023 (Oral). https://arxiv.org/abs/2304.08485

ML Wiki

Explorer