What It Is
Probing is an interpretability technique where you train a small classifier (a “probe”) on top of a neural network’s internal activations to test whether some concept of interest — board state, syntactic structure, sentiment — is encoded in those activations.
If the probe achieves high accuracy, the concept is present in the representation. If accuracy is at chance, it isn’t.
Why It Matters
You can’t look at 512-dimensional activation vectors and understand what they mean. Probing gives you a hypothesis-driven way to check: “does this layer know X?” — without reverse-engineering the whole computation.
How It Works
- Pick a concept you want to test (e.g., “is tile E6 white or black in the current Othello game?“)
- Collect pairs of (internal activation, ground-truth label) from many examples
- Train a small classifier (linear or MLP) from activations → label
- Evaluate accuracy on held-out examples
Linear probes vs. nonlinear probes: A linear probe fits a hyperplane through the activation space. If only a nonlinear probe succeeds, the concept is encoded but not in a linearly separable form — it’s “twisted” in activation space. This is exactly what emergent-world-representations-othello-gpt found for Othello board states.
Caveat: Probing shows that information is present, not that it’s used. A probe may recover a concept that the model computed but doesn’t consult for its actual predictions. Intervention experiments (patching activations) are needed to establish causal relevance.
Key Sources
- emergent-world-representations-othello-gpt — the canonical example of probing a GPT model for board state, finding a nonlinear world model