Zero-Shot Transfer

What It Is

Zero-shot transfer is the ability to perform a task without any task-specific training examples — just a natural language description of what to do, or implicitly, by generalizing from a pretraining distribution that never included the target task. The model transfers knowledge from pretraining to the new task without seeing a single labeled example. It is the strongest test of generalization: if the training distribution doesn’t include a task, can the model solve it anyway?

Why It Matters

Zero-shot transfer means models can generalize to tasks and categories that didn’t exist when they were trained. CLIP can classify images of concepts it never saw during training (“a photo of a platypus in a graduation cap”) because it learned a joint semantic embedding of images and language, not a fixed set of categories. GPT-3 can summarize, translate, and answer questions without ever being fine-tuned for any of those tasks. Zero-shot transfer is what makes general-purpose models possible — without it, every task would require its own training dataset.

How It Works

In Language Models

Pretraining on broad text corpora teaches the model to solve an implicit distribution of tasks: question answering (Wikipedia facts followed by questions), translation (multilingual text), summarization (articles followed by TL;DRs). Zero-shot transfer occurs when the model generalizes these learned patterns to novel instructions at inference time.

Mechanism: the prompt encodes the task structure, and the model maps this to the closest learned pattern. “Translate to French: {text}” matches patterns from multilingual training data. “Is the sentiment positive or negative?” matches patterns from product reviews and text classification examples seen in pretraining.

Scale is critical: zero-shot performance on structured tasks is near-random for small models and emergent above a scale threshold. GPT-3’s 175B parameters exhibit reliable zero-shot behavior; GPT-2’s 1.5B does not for many tasks.

In CLIP (Visual Zero-Shot)

CLIP replaces learned image classifiers with text embeddings of class names:

Standard supervised ImageNet classifier:
  1,000 fixed classes → 1,000-way linear layer → trained on ImageNet
  Can classify: only the 1,000 ImageNet classes
  
CLIP zero-shot classifier:
  Encode class names as text: ["a photo of a cat", "a photo of a dog", ...]
  Encode test image: v_image
  Compute cosine similarity to each class embedding
  Predict: argmax(cosine_sim(v_image, v_text_i))
  Can classify: any concept expressible in language

No ImageNet training is required. CLIP achieves 76.2% top-1 accuracy on ImageNet zero-shot — matching the performance of a supervised ResNet-50 that was fully trained on 1.28M ImageNet images.

The class name prompt matters: “a photo of a {class}” outperforms just “{class}” by several percentage points because CLIP was trained on natural language captions, not bare labels. Prompt engineering is effective: “a photo of a {class}, a type of pet” beats “a photo of a {class}” for fine-grained categories.

The Zero-Shot Cascade

Zero-shot transfer cascades through embedding spaces: if a multimodal model learns that audio spectrograms and images of the same scene cluster together, it can perform audio-to-image retrieval zero-shot, even without training on audio-image pairs. ImageBind demonstrates this — zero-shot audio classification emerges from image-audio + image-text alignment, without ever training on audio-text pairs.

Trained:  image ↔ text alignment
          image ↔ audio alignment

Emergent: text ↔ audio alignment (zero-shot)
          text → audio retrieval (zero-shot)

What’s Clever

The key non-obvious property: the text prompt is a program that selects a behavior from the model’s learned repertoire. “Respond in the style of Shakespeare” doesn’t require fine-tuning on Shakespeare — it retrieves Shakespearean patterns from pretraining. This is only possible because language is a universal interface to the model’s knowledge.

Contrast with few-shot transfer, which provides demonstration examples. Zero-shot is strictly harder: the model must infer the task from description alone. Interestingly, zero-shot sometimes outperforms few-shot on some tasks because few-shot examples can narrow the model’s interpretation of what’s wanted, introducing false constraints.

Common misconception: zero-shot means the model “knows” the task. It doesn’t — it means the model can generalize its pretraining patterns to the task description. If the task is genuinely outside the pretraining distribution (e.g., solving novel mathematical structures not in training data), zero-shot fails. The “zero-shot” capability is bounded by what the pretraining data implicitly covered.

Paradigm	Labeled Examples at Inference	How Task Is Specified
Zero-shot	0	Task description only
Few-shot (in-context)	1-32 in the prompt	Description + examples
Fine-tuned	Thousands (at training)	Implicit in weights
Zero-shot transfer	0	Pretraining must have implicitly covered it

Key Sources

language-models-are-unsupervised-multitask-learners — GPT-2; first clear demonstration that pure LM pretraining zero-shot transfers to translation, QA, and summarization
clip-learning-transferable-visual-models — CLIP; 76.2% zero-shot ImageNet accuracy matching supervised ResNet-50
language-models-are-few-shot-learners — GPT-3; zero-shot and few-shot scaling behavior across 42 tasks
segment-anything — SAM; zero-shot point-prompted segmentation beating supervised baselines on 16 of 23 holdout datasets
dino-self-supervised-vision-transformers
dinov2-learning-robust-visual-features — DINOv2: frozen ViT-g/14 features transfer to classification, segmentation, and depth without fine-tuning
blip-2-bootstrapping-language-image-pretraining — zero-shot VQAv2 65.2% beating Flamingo80B’s 56.3% with 54× fewer trainable params

in-context-learning — few-shot version of zero-shot; provides examples instead of relying purely on pretraining
contrastive-learning — enables multimodal zero-shot by aligning cross-modal embeddings
emergent-abilities — many zero-shot capabilities are emergent: absent below a scale threshold, present above it
multimodal-embeddings — the shared embedding spaces that enable zero-shot cross-modal transfer
transfer-learning — supervised pretraining + fine-tuning; zero-shot transfer skips the fine-tuning step

Open Questions

Why does zero-shot transfer work at all for tasks with no direct pretraining analog?
Does scale always improve zero-shot, or do some capabilities require task-specific supervision regardless of model size?
How do you reliably measure zero-shot capability without accidentally including the test distribution in pretraining data?
Can prompt optimization close the gap between zero-shot and fine-tuned performance?

ML Wiki

Explorer

Zero-Shot Transfer

What It Is

Why It Matters

How It Works

In Language Models

In CLIP (Visual Zero-Shot)

The Zero-Shot Cascade

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Zero-Shot Transfer

What It Is

Why It Matters

How It Works

In Language Models

In CLIP (Visual Zero-Shot)

The Zero-Shot Cascade

What’s Clever

Zero-Shot vs. Related Concepts

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks