Vision-Language Models (VLMs)

What It Is

Vision-language models (VLMs) are neural networks that take both images and text as input and produce text as output. They can describe images, answer questions about visual content, read text in images, and reason across modalities.

Why It Matters

VLMs are the backbone of modern document understanding, OCR, medical imaging, and visual question answering. As LLMs matured, the most impactful next step was giving them eyes — enabling them to process the visual world rather than only text.

How It Works

Most VLMs use a vision encoder (often a ViT) to embed image patches into vectors, then project those vectors into the same space as text token embeddings. The combined sequence of image and text embeddings is fed into a transformer decoder that generates text autoregressively. The key research question is how to align the visual and language representation spaces during pre-training.

Key Sources

consensus-entropy-multi-vlm-agreement-ocr
an-image-is-worth-16x16-words
clip-learning-transferable-visual-models
dino-self-supervised-vision-transformers
falcon-perception-vlm
llava-visual-instruction-tuning
metis-hdpo-meta-cognitive-tool-use
numina-counting-text-to-video
scaling-laws-neural-language-models
segment-anything
simclr-contrastive-learning-visual-representations
blip-2-bootstrapping-language-image-pretraining — establishes the freeze-then-bridge pattern: train only a 188M Q-Former between frozen ViT and frozen LLM
flamingo-visual-language-model-few-shot-learning — introduces interleaved image-text training and gated cross-attention for few-shot visual prompting; the architectural ancestor of BLIP-2 and LLaVA
llava-1-5-improved-baselines-with-visual-instruction-tuning
qwen2-5-vl-technical-report
sam-2-segment-anything-in-images-and-videos
word2vec-efficient-estimation-word-representations

attention — VLMs apply cross-attention between visual and language tokens
transformer — the backbone architecture for most VLMs
contrastive-learning — CLIP-style training is a common VLM pre-training strategy
ensemble-methods — CE-Ensemble uses multiple VLMs to improve OCR quality

ML Wiki

Explorer

Vision-Language Models (VLMs)

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Vision-Language Models (VLMs)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks