GPT-4 came out in March 2023 and could accept image inputs — but nobody knew how it worked, and no open-source model could do anything similar. CLIP could align images and text into a shared embedding space. LLaMA could follow instructions. But connecting them — building a model that could see an image and answer questions about it in a conversational way — required careful glue, the right training data, and a key insight about what multimodal instruction tuning actually means. LLaVA (Large Language and Vision Assistant), published by Liu et al. in April 2023, provided all three.
The core idea
The analogy: You have two experts in different rooms. One speaks only in images — a vision encoder that can describe what it sees in a high-dimensional embedding. One speaks only in words — a language model that can reason and follow instructions. The problem: they can’t talk to each other. LLaVA builds a translator between them — a simple projection layer that converts visual embeddings into the same vocabulary the language model understands.
But the architecture is only half the innovation. The bigger contribution was the data. You can’t collect millions of humans having conversations about images and following instructions — that’s expensive and slow. Instead, Liu et al. used GPT-4 to generate the instruction-following data automatically, by feeding it image captions and bounding box descriptions (no actual images — text only) and prompting it to create:
- Conversation data: multi-turn Q&A dialogues about the image
- Detailed description data: long-form descriptions of what’s in the image
- Complex reasoning data: questions requiring multi-step reasoning about the scene
This yielded 158,000 multimodal instruction-following examples, generated cheaply from existing image-text datasets like COCO, without requiring humans to look at images at all.
The mechanism, step by step
Architecture:
IMAGE
|
[CLIP Vision Encoder: ViT-L/14]
|
-> 256 visual tokens (each a 1024-dim vector)
|
[Linear Projection Layer W] <- the only new parameters trained from scratch
|
-> 256 visual tokens in LLM embedding space (4096-dim for LLaMA-13B)
|
[concatenated with text tokens for the instruction/question]
|
[LLaMA Language Model]
|
Output text response
The projection layer is a single matrix multiplication: , where are the CLIP visual features and maps from CLIP’s visual space (dimension ) to LLaMA’s text embedding space (dimension ). That’s it — no cross-attention modules, no specialized multimodal fusion layers.
Training stages:
Stage 1 — Feature alignment (pretraining):
- Freeze CLIP and LLaMA
- Train only the projection layer
- Data: ~595K image-caption pairs from CC3M, filtered to 595K examples
- Task: predict the caption given the image
- Goal: align visual features with LLM’s word embeddings
Stage 2 — End-to-end visual instruction tuning:
- Freeze CLIP
- Train both the projection layer and all LLaMA weights (or use LoRA for efficiency)
- Data: the 158K GPT-4-generated instruction-following examples + LLaVA-Bench
- Task: follow visual instructions, answer questions about images, hold conversations
- Loss: next-token prediction on answer tokens only — , where the sum is over answer token positions and gradient is not propagated through instruction tokens
Data generation pipeline:
Source image annotation (COCO captions + bounding boxes)
|
→ TEXT description passed to GPT-4 (no actual image!)
→ GPT-4 asked to generate: "If you saw this image, what questions would users ask?
What would be good answers?"
→ Output: conversation/description/reasoning examples
|
Used as training data for LLaVA
The trick of using text-only GPT-4 to generate multimodal training data is elegant: GPT-4’s reasoning ability is preserved without needing multimodal API access, and the generated dialogues are higher quality than what cheaper models would produce.
Find the instinct
Why does a simple linear projection work?
This is the most surprising aspect of LLaVA. Intuitively, you might expect that aligning vision and language requires a complex bridging architecture — a Perceiver Resampler (as in Flamingo), a Q-Former with learned queries (as in BLIP-2), or extensive cross-attention layers. LLaVA shows that a single linear map suffices.
The reason is CLIP. CLIP was trained to align images and text in a shared embedding space — its visual encoder already produces representations that are semantically meaningful in a language-compatible way. The visual tokens from CLIP-ViT already “speak a language” that’s close to what LLMs understand. The projection doesn’t need to do heavy semantic lifting; it just needs to rescale and rotate the space slightly.
“We find that a simple linear layer can effectively connect visual features with language models, enabling visual understanding capabilities through instruction tuning.”
The insight generalizes: LLMs trained on enough text have already developed rich conceptual representations. Visual features from strong vision encoders are already semantically structured. The gap between them is smaller than it looks — you mostly need alignment, not translation.
Why generate instruction data with text-only GPT-4?
The prior approach to multimodal training was: gather image-text pairs (image + caption), train on next-token prediction. This produces models that can describe images but not follow instructions — they answer questions differently than they should in a conversational context.
LLaVA’s insight: the bottleneck isn’t visual understanding, it’s instruction-following in the visual domain. GPT-4 already knows how to reason about image descriptions in text — so you can mine its reasoning to generate training data that teaches the model to behave correctly when given visual instructions.
Results
On ScienceQA (a multimodal science question benchmark with ~21K questions):
- GPT-4 (text-only, with captions): 82.69%
- LLaVA fine-tuned: 90.92%
- LLaVA + GPT-4 ensemble: 92.53% (new SOTA at publication time)
On LLaVA-Bench (a custom multimodal instruction benchmark, scored by GPT-4 relative to GPT-4’s own answers):
- BLIP-2: 38.1% relative score
- OpenFlamingo: 28.7%
- LLaVA: 85.1% relative to GPT-4
The gap is large. The combination of strong CLIP features, LLaMA’s reasoning, a simple projection, and the GPT-4-generated instruction data produced qualitatively different capabilities.
Behaviors the paper highlights:
- Describing detailed content in memes and cartoons
- Identifying unusual elements in images
- Multi-step reasoning about spatial relationships
- Following complex visual instructions
What doesn’t work:
- Struggles with fine-grained object counting
- Can hallucinate details not present in the image
- 336px resolution limit from ViT-L/14 (high-res details are lost)
- The projection layer alignment can fail when CLIP features don’t capture relevant details
Practical implications
LLaVA established the recipe that almost all open-source VLMs now follow:
- Start with a strong frozen vision encoder (CLIP, SigLIP, or similar)
- Start with a strong frozen LLM (LLaMA, Mistral, Qwen, etc.)
- Add a simple projection/connector (linear, MLP, or small cross-attention)
- Two-stage training: align features first, then instruction-tune end-to-end
The successors — LLaVA-1.5, LLaVA-NeXT, InstructBLIP, MiniGPT-4, Qwen-VL, and many others — all trace back to this architecture blueprint. Qwen2.5-VL, which uses a Vision Transformer connected to Qwen2.5 via a cross-attention mechanism, is a direct descendant of LLaVA’s philosophy.
Connections
- multimodal-instruction-tuning — the technique introduced in this paper
- contrastive-learning — CLIP (the vision encoder) is trained via contrastive learning
- sft — Stage 2 of LLaVA training is visual instruction tuning, a form of SFT
- in-context-learning — GPT-4 generates training data via in-context examples; the final model uses visual in-context learning
- lora-low-rank-adaptation — LoRA is commonly used to fine-tune the LLM component in LLaVA variants efficiently
- clip-learning-transferable-visual-models — provides the frozen vision encoder
- attention-is-all-you-need — the underlying architecture for both the ViT encoder and the LLM
Citation
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. NeurIPS 2023 (Oral). https://arxiv.org/abs/2304.08485