What It Is
Vision-language models (VLMs) are neural networks that take both images and text as input and produce text as output. They can describe images, answer questions about visual content, read text in images, and reason across modalities.
Why It Matters
VLMs are the backbone of modern document understanding, OCR, medical imaging, and visual question answering. As LLMs matured, the most impactful next step was giving them eyes — enabling them to process the visual world rather than only text.
How It Works
Most VLMs use a vision encoder (often a ViT) to embed image patches into vectors, then project those vectors into the same space as text token embeddings. The combined sequence of image and text embeddings is fed into a transformer decoder that generates text autoregressively. The key research question is how to align the visual and language representation spaces during pre-training.
Key Sources
-
blip-2-bootstrapping-language-image-pretraining — establishes the freeze-then-bridge pattern: train only a 188M Q-Former between frozen ViT and frozen LLM
-
flamingo-visual-language-model-few-shot-learning — introduces interleaved image-text training and gated cross-attention for few-shot visual prompting; the architectural ancestor of BLIP-2 and LLaVA
Related Concepts
- attention — VLMs apply cross-attention between visual and language tokens
- transformer — the backbone architecture for most VLMs
- contrastive-learning — CLIP-style training is a common VLM pre-training strategy
- ensemble-methods — CE-Ensemble uses multiple VLMs to improve OCR quality