What It Is

Mechanistic interpretability is the study of what neural network weights actually compute — not just what the network does at the input/output level, but which specific circuits, features, and representations inside the model implement each behavior.

Why It Matters

Black-box evaluation tells you a model performs well on a benchmark. Mechanistic interpretability tells you why — which parts of the model are responsible, how information flows, and whether the model has learned a robust generalizing algorithm or a brittle shortcut. This is the foundation for trustworthy AI oversight.

How It Works

The core toolkit:

  • Probing — train a classifier on internal activations to test whether a concept (board state, grammar, sentiment) is encoded there.
  • Activation patching / causal tracing — modify activations at a specific layer and position, then measure which modifications change the output. Identifies which computations are causally necessary.
  • Circuit analysis — trace information flow through specific attention heads and MLP layers to reconstruct the algorithm the model is using.
  • Sparse autoencoders (SAEs) — decompose the superposed representations in MLP layers into interpretable features.

Key Sources