Probing (Neural Network Interpretability)

What It Is

Probing is an interpretability technique where you train a small classifier (a “probe”) on top of a neural network’s internal activations to test whether some concept of interest — board state, syntactic structure, sentiment — is encoded in those activations.

If the probe achieves high accuracy, the concept is present in the representation. If accuracy is at chance, it isn’t.

Why It Matters

You can’t look at 512-dimensional activation vectors and understand what they mean. Probing gives you a hypothesis-driven way to check: “does this layer know X?” — without reverse-engineering the whole computation.

How It Works

Pick a concept you want to test (e.g., “is tile E6 white or black in the current Othello game?“)
Collect pairs of (internal activation, ground-truth label) from many examples
Train a small classifier (linear or MLP) from activations → label
Evaluate accuracy on held-out examples

Linear probes vs. nonlinear probes: A linear probe fits a hyperplane through the activation space. If only a nonlinear probe succeeds, the concept is encoded but not in a linearly separable form — it’s “twisted” in activation space. This is exactly what emergent-world-representations-othello-gpt found for Othello board states.

Caveat: Probing shows that information is present, not that it’s used. A probe may recover a concept that the model computed but doesn’t consult for its actual predictions. Intervention experiments (patching activations) are needed to establish causal relevance.

Key Sources

emergent-world-representations-othello-gpt — the canonical example of probing a GPT model for board state, finding a nonlinear world model

ML Wiki

Explorer

Probing (Neural Network Interpretability)

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Probing (Neural Network Interpretability)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks