What It Is
Mechanistic interpretability is the study of what neural network weights actually compute — not just what the network does at the input/output level, but which specific circuits, features, and representations inside the model implement each behavior.
Why It Matters
Black-box evaluation tells you a model performs well on a benchmark. Mechanistic interpretability tells you why — which parts of the model are responsible, how information flows, and whether the model has learned a robust generalizing algorithm or a brittle shortcut. This is the foundation for trustworthy AI oversight.
How It Works
The core toolkit:
- Probing — train a classifier on internal activations to test whether a concept (board state, grammar, sentiment) is encoded there.
- Activation patching / causal tracing — modify activations at a specific layer and position, then measure which modifications change the output. Identifies which computations are causally necessary.
- Circuit analysis — trace information flow through specific attention heads and MLP layers to reconstruct the algorithm the model is using.
- Sparse autoencoders (SAEs) — decompose the superposed representations in MLP layers into interpretable features.
Key Sources
-
emergent-world-representations-othello-gpt — probing + intervention on Othello-GPT; shows nonlinear world models emerge from sequence training
-
attention-is-all-you-need — the Transformer architecture that most mechanistic interpretability work studies