Summary
Power et al. (2022) discover and name “grokking”: a phenomenon where neural networks first memorize training data (reaching near-zero training loss) but then, much later in training, suddenly achieve near-perfect generalization on held-out data. Studied on small algorithmically generated datasets (modular arithmetic, permutation groups), the paper opens a new lens on how neural networks transition from memorization to generalization.
Key Claims
- Neural networks can generalize long after apparent overfitting, in a phase transition called grokking.
- The delay between memorization and generalization can span orders of magnitude in training steps.
- Smaller datasets require more optimization steps for generalization to emerge.
- Weight decay strongly promotes generalization; without it, grokking may not occur.
- Grokking provides a tractable laboratory for studying generalization in overparameterized networks.
Methods
Experiments on algorithmically generated datasets: modular arithmetic (a+b mod p, a*b mod p, etc.) and permutation group composition. A small Transformer model is trained with Adam optimizer. Training and validation loss curves are tracked for far longer than typical. Ablations over dataset size, weight decay, and optimizer choice. The key observation is that validation accuracy follows a step-function increase long after training accuracy reaches 100%.
Connections
- grokking — the phenomenon introduced
- phase-transition — generalization emerges suddenly after extended memorization, a clear phase transition
- scaling-laws — dataset size and training steps interact in predictable ways governing when grokking occurs
- emergent-abilities — grokking is a discrete capability jump that cannot be predicted from training loss alone
- openai — primary institution
Citation
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. ICLR 2022 Workshop. https://arxiv.org/abs/2201.02177