Concepts: vision-language-models | multimodal-instruction-tuning | multimodal-embeddings | patch-embeddings Builds on: llava-visual-instruction-tuning — original LLaVA recipe (linear projector, ViT-L/14, GPT-4-generated instruction data) Leads to: qwen2-5-vl-technical-report, blip-2-bootstrapping-language-image-pretraining — modern VLMs that take the LLaVA-1.5 recipe as the open baseline
LLaVA-1 made open-source visual instruction tuning real. LLaVA-1.5 made it a credible baseline. The team kept the architecture almost identical and asked a different question: what is the smallest set of changes that pushes the system to state of the art on every public VLM benchmark? The answer turned out to be three changes, each individually unsurprising, that compound: a 2-layer MLP connector instead of a single linear layer, a higher-resolution CLIP-ViT-L at 336px, and a mix of academic VQA datasets with explicit response-format prompts. The whole 13B model trains in roughly one day on 8 A100s using 1.2M public examples.
The core idea
The LLaVA-1 architecture was already mostly right. A frozen CLIP vision encoder produces image tokens; a small projection module casts them into the LLM’s embedding space; the LLM treats them as a normal prefix. The question for 1.5 was: where does this stack actually leak performance, and which fixes are cheap?
Three answers emerged:
-
The connector was too thin. A single linear projection forces the vision-language alignment to happen entirely inside that one matrix. Replacing it with a 2-layer MLP with GELU adds expressive capacity at trivial cost (a few hundred thousand parameters), and lets the projector compose features rather than just rotating them.
-
The image encoder was too small. Switching from ViT-L/14 at 224px to ViT-L/14 at 336px gives the model more visual tokens (576 vs 256) and more spatial detail. For tasks like OCR, chart reading, and document understanding, this matters more than any architectural cleverness.
-
The training data was missing the obvious benchmarks. LLaVA-1 trained only on the 158K GPT-4-generated dialogues. LLaVA-1.5 mixes in standard academic VQA datasets — VQAv2, GQA, OK-VQA, OCR-VQA, A-OKVQA, TextCaps — but with a twist. Each dataset comes with a response-format prompt like “Answer the question using a single word or phrase” so the model knows when to be terse vs. when to be conversational. Without this, mixing short-answer datasets degrades the model’s chat behavior.
Walkthrough
The full data recipe (1.2M examples total):
Stage 1 — pretraining (alignment):
558K image-caption pairs from LAION/CC/SBU, filtered.
Train ONLY the MLP connector. Vision encoder + LLM frozen.
~1 hour on 8 A100s.
Stage 2 — visual instruction tuning:
158K LLaVA conversations (from LLaVA-1)
+ 83K VQAv2 (open-ended VQA)
+ 72K GQA (compositional reasoning)
+ 9K OK-VQA (knowledge VQA)
+ 80K OCR-VQA (text in images)
+ 50K A-OKVQA (knowledge + reasoning)
+ 22K TextCaps (caption text-rich images)
+ 40K RefCOCO (visual grounding)
+ 40K Visual Genome (region descriptions)
Train MLP connector + LLM. Vision encoder still frozen.
~1 day on 8 A100s for 13B; ~5 hours for 7B.
The response-format prompt trick:
For VQAv2 (short-answer): "Answer the question using a single word or phrase."
For LLaVA conversations: no special prompt — model defaults to chat.
For OCR-VQA: "Answer the question using a single word or phrase."
For grounding: "<image>\n...<region>" with explicit coordinate tokens.
This single line, prepended to every short-answer training example, lets the same model toggle between “be a chatbot” and “be a benchmark answerer” based on prompt cue at inference time. Without it, the model either refuses to be terse on benchmarks (and loses 5-10 points absolute) or becomes uselessly terse during free chat.
What’s clever — find the instinct
The clever move is recognizing that the LLaVA-1 results were artificially low for an avoidable reason: the model had never been told to give one-word answers. The benchmark numbers were measuring two things at once — the model’s visual understanding and its calibration of response length to task — and the latter was draggig the former down.
“We use a different response formatting prompt for short-form VQA dataset to avoid the model overfitting to short-form answers and serve as a strong baseline.”
The architecture changes (MLP, 336px) are real but small effects. The data fix is the headline. State of the art on 11 benchmarks didn’t require a new training paradigm; it required prompting the training data correctly.
The other clever recognition: this whole stack is now small enough that a researcher with one node can reproduce it. The original LLaVA was a research artifact. LLaVA-1.5 is a baseline anyone can run, fork, and build on.
“Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node.”
Does it work? What breaks?
Headline numbers (LLaVA-1.5-13B vs prior open models):
| Benchmark | LLaVA-1 | InstructBLIP-13B | Qwen-VL-Chat | LLaVA-1.5-13B |
|---|---|---|---|---|
| VQAv2 | — | — | 78.2 | 80.0 |
| GQA | — | 49.5 | 57.5 | 63.3 |
| TextVQA | — | 50.7 | 61.5 | 61.3 |
| MMBench | 36.2 | — | 60.6 | 67.7 |
| MM-Vet | 26.7 | 25.6 | — | 35.4 |
| POPE (hallucination) | — | 78.9 | — | 85.9 |
The model is not just incrementally better — it is the first openly reproducible VLM that beats Qwen-VL-Chat (which had access to a much larger proprietary instruction set) on most benchmarks.
What breaks:
- Image resolution caps at 336px. Tasks needing very high-resolution detail (small text, fine charts, detailed documents) are still bottlenecked. LLaVA-NeXT (1.6) and Qwen2.5-VL later address this with dynamic-resolution processing.
- Visual token count is fixed (576). Long-form image content (multi-page documents, video frames) doesn’t fit cleanly.
- No video, no audio. This is a single-image VLM. Video and audio require different glue.
- Hallucination is reduced but not solved. POPE 85.9 is good — but the model still confabulates objects when the visual evidence is ambiguous.
So what?
LLaVA-1.5 is the recipe everyone borrowed. Its specific choices — frozen CLIP encoder, MLP connector, two-stage training (align connector, then instruction-tune everything), academic VQA mixing with response-format prompts — became the default architecture for open-source VLMs from 2023 to 2025. Qwen-VL, MiniGPT, Kosmos, BLIP-2 all converged on minor variants of this stack. The from-scratch ViT in Qwen2.5-VL is itself a critique of LLaVA’s “frozen CLIP” choice — but the rest of the recipe is preserved.
For an engineer running a VLM pipeline today (e.g., Saikat’s POI-extraction-from-street-imagery system using Qwen3.5-flash), the LLaVA-1.5 paper is still the reference for why certain prompting formats work. When fine-tuning a VLM to extract structured outputs from images, the response-format-prompt trick is exactly what’s needed to keep the model from getting chatty in the wrong places. The trick generalizes: when mixing datasets with different output expectations, prompt the format explicitly during training, not just at inference.
“We hope this can make state-of-the-art LMM research more accessible.”
This worked. By 2024 the LLaVA-1.5 codebase and weights had become the default starting point for vision-language research outside the major labs.
Connections
- llava-visual-instruction-tuning — the original LLaVA paper this directly builds on
- clip-learning-transferable-visual-models — CLIP-ViT-L is the frozen vision encoder
- blip-2-bootstrapping-language-image-pretraining — alternative VLM lineage with a Q-Former connector
- qwen2-5-vl-technical-report — modern VLM that replaces frozen CLIP with a from-scratch dynamic-resolution ViT
- vision-language-models — LLaVA-1.5 is the open baseline
- multimodal-instruction-tuning — the response-format-prompt trick is a key contribution to this concept
- multimodal-embeddings — the MLP connector projects CLIP embeddings into LLM space
- patch-embeddings — 336px CLIP yields 576 patch embeddings (24x24 grid)
Citation
Liu, H., Li, C., Li, Y., & Lee, Y. J. (2023). Improved Baselines with Visual Instruction Tuning. CVPR 2024 (highlight). https://arxiv.org/abs/2310.03744