The Problem
Before GPT-3, the standard paradigm was: pretrain a large model, then fine-tune it on task-specific data. Sentiment analysis? Fine-tune. Translation? Fine-tune. Each task required labeled data and a training run.
Then GPT-3 showed something strange. If you put a few examples of sentiment analysis directly in the prompt — no gradient updates, no task-specific training — the model would do sentiment analysis. And it was nearly as good as fine-tuned models.
The model wasn’t learning. Its weights didn’t change. It was doing something else.
What ICL Actually Is
In-context learning is the model’s ability to use the structure of examples in its input to perform a task it wasn’t explicitly trained on — purely through pattern completion.
The model doesn’t update. It reads your examples, infers what you’re asking, and generates a completion that fits the pattern.
Zero-shot: give it a task description only (“Classify this as positive or negative:”). One-shot: one example before the test input. Few-shot: 4-32 examples before the test input.
GPT-3 (175B) showed this capability. GPT-2 (1.5B) didn’t reliably. It’s an emergent behavior of scale.
Mechanism in Plain English
Nobody fully knows why it works. The two best mechanistic theories:
Theory 1: Implicit Bayesian inference
The model, through pretraining, has learned a compressed model of how text is generated across many tasks and domains. When you show it examples of task X, you’re Bayesian-updating its “belief” about which task it’s currently in. The model essentially asks: “given these examples, what task distribution am I drawn from, and what should I output next?”
Theory 2: Gradient descent in the forward pass
Some theoretical work (Akyürek et al. 2022, Dai et al. 2022) shows that transformer attention can implement gradient descent steps implicitly. Attention layers with linear self-attention mathematically implement ridge regression updates. ICL demonstrations may be providing the “training signal” that transformer activations process as a sort of forward-pass learning.
Both theories are partial. The full answer is open.
ASCII Diagram
Few-shot ICL prompt structure:
┌─────────────────────────────────────────────────┐
│ Review: "Absolutely loved it!" → Positive │ ← example 1
│ Review: "Terrible product." → Negative │ ← example 2
│ Review: "Not bad, I guess." → Neutral │ ← example 3
│ │
│ Review: "Best purchase ever!" → ??? │ ← test input
└─────────────────────────────────────────────────┘
↓
model completes: "Positive"
No gradient updates. No fine-tuning.
The format alone told the model what to do.
────────────────────────────────────────────────────
What changes with model size (GSM8K math, few-shot):
Model size: 1B 7B 13B 65B 175B 540B
Accuracy: 2% 11% 17% 35% 46% 58%
↑ emergent above ~100B
Concrete Walkthrough
Task: entity extraction. Extract the company name from a sentence.
Zero-shot:
Input: "Apple released a new iPhone yesterday."
Task: Extract the company name.
Output: [model guesses "Apple" — may or may not work]
Few-shot (3 examples):
"Microsoft acquired Activision." → Company: Microsoft
"Google launched Bard in 2023." → Company: Google
"Meta rebranded from Facebook." → Company: Meta
"Apple released a new iPhone yesterday." → Company:
Model outputs: “Apple”
The examples didn’t teach the model what companies are — it already knew. They told the model the format and what aspect to extract. The demonstrations constrain the output space.
This matters: ICL is about format specification and task disambiguation, not information transfer. The model already knows everything it needs. The examples just tell it which task you’re doing.
What’s Clever
The surprising thing isn’t that ICL works — it’s that the labels barely matter.
Several studies showed that randomizing the labels in few-shot examples (mapping “positive” to “negative” and vice versa) barely hurts performance. The model is using the structure of the demonstrations (input-output format, which tokens appear, what style of completion is expected) far more than the semantic correctness of the labels.
This suggests ICL is less about “learning from examples” and more about “activating the right task head.” The pretraining corpus already contains implicit representations of countless tasks. ICL prompts steer the model toward the right representation.
The second insight: label space and output format matter more than example quality. Choosing good input-output format, consistent style, and representative examples in the right distribution is what actually drives ICL performance — not example count or quality per se.
Key Sources
Related Concepts
Open Questions
- Mechanistic explanation of how ICL occurs in transformer forward passes
- Why longer context and more demonstrations don’t always help