Concepts: mechanistic-interpretability | probing | emergent-behavior | transformer Builds on: attention-is-all-you-need | emergent-abilities-of-large-language-models Leads to: Nanda et al. 2023 “Emergent Linear Representations” (follow-up, no explainer yet)
The problem
Language models predict the next token. That’s the whole training objective. But when a model trained this way also turns out to be good at chess puzzles, code completion, or logical reasoning, you have to ask: what is actually going on inside?
Two camps: camp one says these models are pattern-matchers, memorizing surface statistics — they’ve seen enough chess notation that they’ve learned “after this sequence of moves, this type of move tends to appear.” Camp two says they build genuine internal world models — they develop something like a mental representation of the game state that they consult when making predictions.
The stakes are real. If it’s memorization, then performance on novel distributions will collapse. If it’s a world model, the model may be doing something much more like reasoning.
The core idea
The analogy. Imagine a professional sports commentator who has, for their entire career, only read written transcripts of games — move notations, no boards. Never seen a game played. But they’ve read ten million game transcripts. You slide them a partial transcript and ask: what’s the next legal move? If they can answer reliably, they must have developed a mental model of the board, because there’s no other way to reliably know what’s legal. You can test this further: tell them “assume square E6 was black instead of white — now what’s legal?” If their answer changes correctly, you’ve confirmed the model is doing causal computation on an internal board representation, not just pattern-matching the input string.
That’s exactly what this paper does, except the commentator is a GPT model and the game is Othello.
The mechanism, step by step.
Othello is an 8×8 board game. Players alternate placing discs; a move is legal only if it sandwiches at least one opponent disc. The paper trains an 8-layer GPT model — Othello-GPT — on sequences of moves, where each token is a tile index (60 possible tiles, excluding the 4 center tiles). No board state is ever given. No rules are ever specified. Just: here’s a sequence of moves, predict the next one.
Step 1: collect 20 million Othello games (random legal play) as training sequences.
Step 2: train an 8-layer, 512-dim GPT with causal masking to predict the next move token given the sequence so far.
Step 3: evaluate on held-out games — does the model predict legal moves?
Step 4: train a “probe” — a small classifier whose input is the model’s internal activations at layer L, and whose output is the board state (for each of the 60 tiles: is it black, white, or empty?). If the probe achieves high accuracy, the board state is somehow encoded in the activations.
Step 5: run intervention experiments — modify activations to match a counterfactual board state, observe whether the model’s move predictions change accordingly. If they do, the representation is causal, not just correlational.
The ASCII picture.
TRAINING: Othello-GPT sees only move tokens
─────────────────────────────────────────────────────────────────────
Input: [D5, C4, E3, B3, C3, D3, ...] ← tile indices only
no board, no rules, no coordinates
|
v
8-layer GPT (causal masking)
|
v
Output: probability distribution over next move
(model learns to give high prob to legal moves)
Error rate: 0.01% ← near-perfect legality
PROBING: Is the board state in the activations?
─────────────────────────────────────────────────────────────────────
Othello-GPT
┌─────────────────────────────────┐
tokens │ layer 1 → 2 → 3 → 4 → 5 → 6 │ → next move
└─────────────────────────────────┘
↑
x_t^6 (512-dim vector)
│
LINEAR PROBE NONLINEAR PROBE (2-layer MLP)
error: 21.9% error: 1.7%
(barely better (board state almost perfectly
than random) decoded from activations!)
INTERVENTION: Is the representation causal?
─────────────────────────────────────────────────────────────────────
Real board B → model predicts moves legal for B
↓
gradient-descent on activations
until probe reports B' (one tile flipped)
↓
Counterfactual B' → model now predicts moves legal for B' ✓
The math.
The probe architecture:
Where is the activation at layer for token , and the output is a 3-class distribution per tile (black / white / empty).
The intervention step modifies activations via gradient descent on the probe’s cross-entropy loss against the desired board state :
Translation: we don’t change the model weights. We change the activations until they look like what the probe would expect for board state . Then we resume forward computation from that modified activation. The model “thinks” it’s at now.
Walkthrough with real numbers.
Let’s trace what happens when the probe extracts board state from layer 6 activations.
After 12 moves of a synthetic game:
Tile D5: black disc
Tile E5: white disc (just flipped by move 12)
Tile F4: empty
... (54 other tiles)
Layer 6 activation x_t^6: 512-dimensional vector
(we don't see this directly — it's the probe's input)
Linear probe prediction for tile E5:
softmax(W · x_t^6)[white] = 0.41
softmax(W · x_t^6)[black] = 0.35
softmax(W · x_t^6)[empty] = 0.24
→ predicts white (correct), but barely (41% confidence)
→ Linear probe error across all tiles: 21.9%
Nonlinear probe prediction for tile E5:
softmax(W₁ · ReLU(W₂ · x_t^6))[white] = 0.97
→ predicts white with 97% confidence
→ Nonlinear probe error across all tiles: 1.7%
Intervention — flip E5 from white to black:
Run gradient descent on x_t^6 (and subsequent layers)
until probe reports E5 = black
Compute legal moves for the counterfactual board...
Result: model's top predictions now include moves that
sandwich the newly-black E5, and exclude moves that
depended on E5 being white.
What’s clever — find the instinct.
The nonlinearity finding is the surprise. Standard probing work assumes that if a representation exists, a linear probe will find it — because the brain/model is presumed to store information in a form it can use linearly. Finding that only nonlinear probes work means the board representation is there, but it’s “twisted” — encoded in a curved manifold within the 512-dimensional activation space, not a simple linear subspace.
“Linear probes never achieve error rates below 20%, barely outperforming probes trained on a randomly initialized network. This result suggests that if there is an internal representation of the board state, it does not have a simple linear form.”
This seems like a dead end. But then the nonlinear probe drops error to 1.7%. The information is there — it’s just not accessible to a linear read-out. The intervention experiment then shows it’s causal:
“We influence internal activations during Othello-GPT’s calculation and measure the resulting effects… we observe that the model’s predicted move distribution shifts to match the counterfactual board state.”
The non-obvious instinct: the model had no choice but to develop this representation. Predicting legal moves is a function of the board state, which is a nonlinear function of the move sequence. You cannot reliably answer “is E6 legal?” without tracking “what’s on E6 right now?” across all the flipping that’s happened. The model couldn’t do it with pure surface statistics — the skewed dataset experiment proves this:
“Since Othello-GPT has seen none of these test sequences before, pure sequence memorization cannot explain its performance.”
(Remove all C5 openings from training — 25% of the game tree. Error rate on novel games: 0.02%. The model generalized to game positions it could never have memorized.)
The follow-up twist (not in this paper).
A 2023 follow-up by Nanda et al. ran a different probe and found the representation is actually linear — in the right coordinate system. The original paper’s linear probes failed because they probed the raw activation basis; Nanda et al. found a rotated basis where the board state reads out linearly. This suggests the information is linearly encoded but “rotated” relative to what naive probing finds. The debate about linear vs. nonlinear world models in transformers continues.
Does it actually work? What breaks?
| Setting | Metric | Value | vs. Baseline |
|---|---|---|---|
| Synthetic-trained | Legal move error | 0.01% | vs. 93.29% untrained (random chance) |
| Championship-trained | Legal move error | 5.17% | still far better than random |
| Nonlinear probe (layer 6) | Board state error | 1.7% | vs. 21.9% linear, 25.5% random network |
| Intervention (flip 1 tile) | Prediction shifts correctly | ~60–70% | vs. 0% if representation were non-causal |
What doesn’t work: the intervention success rate (~60–70%) is real but not perfect. The gradient descent modification to activations is a blunt instrument — it pushes one tile’s probe score but creates noise in other tiles, degrading the model’s internal consistency. Causal control is partial, not surgical.
The nonlinear probe also limits interpretability: you can extract the board state, but you can’t easily reason about where specific information lives or why particular tiles are easier to decode at certain layers. Mechanistic interpretability tools that work on linear circuits (like those developed by Anthropic’s interpretability team) don’t apply directly.
There’s also a gap between “the model has a world model” and “the model uses the world model the way a human would.” The probe-extracted board state is a side-effect of the computation — but the actual forward pass might be doing something stranger. Intervention success is evidence of a causal link, not proof of human-like board simulation.
So what?
If you’re building ML systems where internal representations matter — interpretability, oversight, reliability on novel inputs — this paper has a direct lesson: sequence models can develop genuine world models, but those models may be nonlinearly encoded and invisible to standard linear probes. If your linear probe says “nothing here,” try a nonlinear probe before concluding the information is absent.
For LLM oversight research: the intervention technique is a prototype for “activation steering” — the idea that you can modify a model’s internal representations at runtime to change its behavior. This became a whole research program (the SAE / representation engineering work that followed). The gap between “read out the world model” and “surgically modify it” is still the frontier.
The deeper connection: emergent-behavior looked at whether capabilities emerge sharply at scale. Othello-GPT shows something complementary — that representations emerge from training on sequences, even when those representations were never explicitly trained. The model didn’t learn a board; it learned a board model as a side effect of learning to predict. This is evidence that neural networks do more than collect surface statistics, but it’s a different kind of evidence than scaling curves.
Remember how emergent-abilities-of-large-language-models showed sharp capability jumps at scale? Othello-GPT is silent on that question — the game is simple enough that even a moderate model learns it. What it answers is the prior question: does a sequence model, in principle, build internal structure? The answer here is yes — and that structure is causal, not decorative.
The one-sentence version: a GPT trained only on Othello move sequences develops a nonlinear internal model of the board state — one that can be surgically modified to change the model’s predictions — proving sequence models can build genuine world models rather than just memorize surface patterns.
Connections
- mechanistic-interpretability — introduces probing and intervention as core interpretability tools
- probing — the primary methodology used in this paper
- emergent-behavior — evidence that world models emerge without explicit supervision
- transformer — the GPT architecture whose internals are being probed
- in-context-learning — related question: does ICL also involve emergent internal structure?
Citation
Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. ICLR 2023. https://arxiv.org/abs/2210.13382