What It Is

Multimodal instruction tuning is the process of fine-tuning a language model that has been connected to a vision encoder to follow natural language instructions about images. The result is a Vision-Language Model (VLM): a system that can see images, answer questions about them, describe them, reason about visual content, and hold multi-turn conversations grounded in visual context. LLaVA (Visual Instruction Tuning, Liu et al. 2023) established the dominant recipe that most open-source VLMs now follow.

Why It Matters

Before visual instruction tuning, multimodal models could classify images or generate captions, but couldn’t follow complex natural language instructions about visual content. GPT-4V (released March 2023) demonstrated that this capability existed, but was closed-source. LLaVA showed the open-source community how to replicate it cheaply — with a small dataset of GPT-4-generated training data and a simple linear projection connecting CLIP to LLaMA. This triggered an explosion of open VLMs.

The Architecture Blueprint

The dominant VLM architecture (established by LLaVA, followed by LLaVA-1.5, InstructBLIP, Qwen-VL, and many others):

IMAGE
  |
[Vision Encoder] — typically a CLIP ViT, frozen during training
  |
  → N visual tokens (patch embeddings in vision space)
  |
[Connector/Projection] — maps visual to language embedding space
  Options:
  - Linear projection (LLaVA-1.0): simplest, surprisingly effective
  - MLP (LLaVA-1.5): small 2-layer MLP, better
  - Q-Former (BLIP-2): cross-attention with learned queries, more parameters
  - Resampler (Flamingo, Idefics): similar to Q-Former
  |
  → N visual tokens in LLM embedding space
  |
[Concatenate with text instruction tokens]
  |
[Language Model (LLaMA, Qwen, Mistral, etc.)] — trains on visual + text tokens
  |
Output text (description, answer, etc.)

The key insight from LLaVA: CLIP features are already semantically close to what LLMs understand (both trained on image-text alignment at scale). The projection doesn’t need to be complex — it just needs to rotate and rescale the visual embedding space into the LLM’s text space.

Training Stages

The standard two-stage training pipeline:

Stage 1 — Feature alignment (frozen LLM, frozen vision encoder, train only connector):

  • Data: image-caption pairs (LLaVA uses ~595K CC3M filtered pairs)
  • Task: predict the caption from the image
  • Goal: align the visual feature space with the LLM’s token embeddings
  • LLM is frozen — it just learns to “see” images via the connector

Stage 2 — Instruction tuning (frozen vision encoder, train connector + LLM):

  • Data: instruction-following examples (Q&A, conversations, detailed descriptions, complex reasoning)
  • LLM weights are fine-tuned (or LoRA-tuned for efficiency)
  • Goal: teach the model to follow visual instructions, not just describe images

LLaVA’s key innovation: generating Stage 2 data cheaply via GPT-4. Given image captions and bounding box annotations (text only — no actual images sent to GPT-4), GPT-4 generates multi-turn Q&A, detailed descriptions, and complex reasoning questions. 158K examples generated this way produced qualitatively new capabilities.

Data Pipeline: Generating Multimodal Instruction Data

The bottleneck for VLM training is instruction-following data with rich visual grounding. Collecting this from humans is expensive. LLaVA’s solution:

  1. Take existing image annotations (COCO captions + bounding box descriptions)
  2. Feed the text annotations to GPT-4 (no actual image needed)
  3. Prompt GPT-4 to generate: “What questions would users ask about this image? What are the ideal answers?”
  4. GPT-4 produces: conversation data, long-form description data, complex reasoning data

The insight: GPT-4’s reasoning capability is preserved when reasoning about text descriptions of images. The generated dialogues are higher quality than what cheaper models produce. And the entire pipeline requires no human annotators looking at images.

Why a Simple Linear Projection Suffices

This is the most counterintuitive finding from LLaVA. You might expect that bridging vision and language requires complex cross-attention (Flamingo), or carefully designed query transformers (BLIP-2). LLaVA shows a single linear layer works nearly as well.

Why: CLIP’s vision encoder was trained to produce image representations aligned with text descriptions of those images. The visual tokens are already “thinking in language-adjacent terms.” The LLM has seen enough text about the visual world that its embedding space already has structure for visual concepts. The gap between them is smaller than intuition suggests — a linear projection closes most of it.

VLM Capabilities and Failure Modes

Capabilities enabled by visual instruction tuning:

  • Detailed image description
  • Visual Q&A (natural language questions about image content)
  • Multi-turn visual conversations
  • OCR and document understanding (with high-res variants)
  • Diagram and chart interpretation
  • Meme and scene understanding

Persistent failure modes:

  • Object counting: reliably counting more than 3-4 items in a scene remains difficult
  • Hallucination: generating plausible-sounding descriptions of things not in the image
  • Resolution limits: standard ViT-L/14 inputs at 336px lose fine-grained details
  • Spatial reasoning: complex left/right/above/below relationships cause errors
  • Rare visual concepts: objects or scenes absent from CLIP’s training distribution

Successors and Variants

The LLaVA blueprint has been extended by:

  • LLaVA-1.5: MLP connector instead of linear, better data; most of the quality gap to GPT-4V closed
  • LLaVA-NeXT (LLaVA-1.6): high-resolution support via image splitting
  • InstructBLIP: Q-Former architecture, Vicuna backbone
  • Qwen-VL / Qwen2.5-VL: cross-attention connector, high-resolution support, state-of-the-art open-source VLM
  • InternVL: strong vision encoder training combined with the LLaVA pipeline

Key Sources

  • contrastive-learning — CLIP, the standard vision encoder for VLMs, is trained via contrastive learning
  • sft — Stage 2 visual instruction tuning is supervised fine-tuning on multimodal data
  • in-context-learning — VLMs leverage the LLM’s in-context learning ability for visual reasoning
  • lora — LoRA is commonly used for efficient fine-tuning of the LLM component
  • transfer-learning — the entire VLM recipe relies on transfer from CLIP and the pretrained LLM

Open Questions

  • How to reduce hallucination while preserving helpfulness?
  • What is the optimal vision encoder for VLMs — should it be trained specifically for VLM use?
  • How to support very high resolution (4K+) without prohibitive token counts?
  • Can the two-stage pipeline be replaced with a single end-to-end training run?
  • What data mixture and scale optimally develops each VLM capability?