In-Context Learning (ICL)

The Problem

Before GPT-3, the standard paradigm was: pretrain a large model, then fine-tune it on task-specific data. Sentiment analysis? Fine-tune. Translation? Fine-tune. Each task required labeled data and a training run.

Then GPT-3 showed something strange. If you put a few examples of sentiment analysis directly in the prompt — no gradient updates, no task-specific training — the model would do sentiment analysis. And it was nearly as good as fine-tuned models.

The model wasn’t learning. Its weights didn’t change. It was doing something else.

What ICL Actually Is

In-context learning is the model’s ability to use the structure of examples in its input to perform a task it wasn’t explicitly trained on — purely through pattern completion.

The model doesn’t update. It reads your examples, infers what you’re asking, and generates a completion that fits the pattern.

Zero-shot: give it a task description only (“Classify this as positive or negative:”). One-shot: one example before the test input. Few-shot: 4-32 examples before the test input.

GPT-3 (175B) showed this capability. GPT-2 (1.5B) didn’t reliably. It’s an emergent behavior of scale.

Mechanism in Plain English

Nobody fully knows why it works. The two best mechanistic theories:

Theory 1: Implicit Bayesian inference

The model, through pretraining, has learned a compressed model of how text is generated across many tasks and domains. When you show it examples of task X, you’re Bayesian-updating its “belief” about which task it’s currently in. The model essentially asks: “given these examples, what task distribution am I drawn from, and what should I output next?”

Theory 2: Gradient descent in the forward pass

Some theoretical work (Akyürek et al. 2022, Dai et al. 2022) shows that transformer attention can implement gradient descent steps implicitly. Attention layers with linear self-attention mathematically implement ridge regression updates. ICL demonstrations may be providing the “training signal” that transformer activations process as a sort of forward-pass learning.

Both theories are partial. The full answer is open.

ASCII Diagram

  Few-shot ICL prompt structure:
  
  ┌─────────────────────────────────────────────────┐
  │  Review: "Absolutely loved it!" → Positive      │  ← example 1
  │  Review: "Terrible product."   → Negative       │  ← example 2
  │  Review: "Not bad, I guess."   → Neutral        │  ← example 3
  │                                                  │
  │  Review: "Best purchase ever!" → ???            │  ← test input
  └─────────────────────────────────────────────────┘
                                           ↓
                                  model completes: "Positive"
  
  No gradient updates. No fine-tuning.
  The format alone told the model what to do.
  
  ────────────────────────────────────────────────────
  
  What changes with model size (GSM8K math, few-shot):
  
  Model size:   1B    7B   13B   65B  175B  540B
  Accuracy:      2%   11%   17%   35%   46%   58%
                                               ↑ emergent above ~100B

Concrete Walkthrough

Task: entity extraction. Extract the company name from a sentence.

Zero-shot:

Input: "Apple released a new iPhone yesterday."
Task: Extract the company name.
Output: [model guesses "Apple" — may or may not work]

Few-shot (3 examples):

"Microsoft acquired Activision." → Company: Microsoft
"Google launched Bard in 2023." → Company: Google
"Meta rebranded from Facebook." → Company: Meta

"Apple released a new iPhone yesterday." → Company:

Model outputs: “Apple”

The examples didn’t teach the model what companies are — it already knew. They told the model the format and what aspect to extract. The demonstrations constrain the output space.

This matters: ICL is about format specification and task disambiguation, not information transfer. The model already knows everything it needs. The examples just tell it which task you’re doing.

What’s Clever

The surprising thing isn’t that ICL works — it’s that the labels barely matter.

Several studies showed that randomizing the labels in few-shot examples (mapping “positive” to “negative” and vice versa) barely hurts performance. The model is using the structure of the demonstrations (input-output format, which tokens appear, what style of completion is expected) far more than the semantic correctness of the labels.

This suggests ICL is less about “learning from examples” and more about “activating the right task head.” The pretraining corpus already contains implicit representations of countless tasks. ICL prompts steer the model toward the right representation.

The second insight: label space and output format matter more than example quality. Choosing good input-output format, consistent style, and representative examples in the right distribution is what actually drives ICL performance — not example count or quality per se.

Key Sources

language-models-are-unsupervised-multitask-learners — GPT-2; earliest demonstration of zero-shot task conditioning through prompt formatting
chain-of-thought-prompting
language-models-are-few-shot-learners
self-consistency-chain-of-thought-reasoning
tree-of-thoughts-deliberate-problem-solving
gemini-1-5-multimodal-long-context
emergent-world-representations-othello-gpt
self-rag-learning-to-retrieve-generate-critique
flamingo-visual-language-model-few-shot-learning — extends ICL to multimodal prompts; few-shot VQA by prepending (image, text) pairs; trained on interleaved web data so the model learns visual prompt structure

Open Questions

Mechanistic explanation of how ICL occurs in transformer forward passes
Why longer context and more demonstrations don’t always help

ML Wiki

Explorer

In-Context Learning (ICL)

The Problem

What ICL Actually Is

Mechanism in Plain English

ASCII Diagram

Concrete Walkthrough

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

In-Context Learning (ICL)

The Problem

What ICL Actually Is

Mechanism in Plain English

ASCII Diagram

Concrete Walkthrough

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks