Mechanistic Interpretability

What It Is

Mechanistic interpretability is the study of what neural network weights actually compute — not just what the network does at the input/output level, but which specific circuits, features, and representations inside the model implement each behavior.

Why It Matters

Black-box evaluation tells you a model performs well on a benchmark. Mechanistic interpretability tells you why — which parts of the model are responsible, how information flows, and whether the model has learned a robust generalizing algorithm or a brittle shortcut. This is the foundation for trustworthy AI oversight.

How It Works

The core toolkit:

Probing — train a classifier on internal activations to test whether a concept (board state, grammar, sentiment) is encoded there.
Activation patching / causal tracing — modify activations at a specific layer and position, then measure which modifications change the output. Identifies which computations are causally necessary.
Circuit analysis — trace information flow through specific attention heads and MLP layers to reconstruct the algorithm the model is using.
Sparse autoencoders (SAEs) — decompose the superposed representations in MLP layers into interpretable features.

Key Sources

emergent-world-representations-othello-gpt — probing + intervention on Othello-GPT; shows nonlinear world models emerge from sequence training
attention-is-all-you-need — the Transformer architecture that most mechanistic interpretability work studies
grokking-generalization-beyond-overfitting

ML Wiki

Explorer

Mechanistic Interpretability

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Mechanistic Interpretability

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks