What It Is

Grokking is the phenomenon where a neural network first overfits training data (memorization) and then, much later in training, suddenly achieves near-perfect generalization on held-out data. The term was coined by Power et al. (2022) and named after Robert Heinlein’s word for deep understanding.

Why It Matters

Grokking challenges the standard view that overfitting and generalization are mutually exclusive phases. It suggests that with sufficient optimization, models can transition from memorizing specific examples to learning the underlying algorithm. This has implications for understanding double descent, the role of weight decay, and why continued training on fine-tuned models can sometimes improve generalization.

How It Works

The mechanism is still debated, but current understanding (from mechanistic interpretability work) suggests:

  1. In the memorization phase, the model uses a “memorization circuit” — essentially a lookup table.
  2. In the grokking phase, a separate “generalization circuit” (representing the true algorithm) is learned in parallel.
  3. Weight decay penalizes the larger memorization circuit more than the compact generalization circuit, eventually causing the generalization circuit to dominate.

Key conditions for grokking: weight decay (or other regularization), sufficient training time, and a task with an underlying algorithmic structure.

Key Sources

Open Questions

  • Does grokking occur in large-scale pretraining?
  • Can grokking explain sudden capability improvements during fine-tuning?
  • Relationship between grokking and double descent