Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Summary

Power et al. (2022) discover and name “grokking”: a phenomenon where neural networks first memorize training data (reaching near-zero training loss) but then, much later in training, suddenly achieve near-perfect generalization on held-out data. Studied on small algorithmically generated datasets (modular arithmetic, permutation groups), the paper opens a new lens on how neural networks transition from memorization to generalization.

Key Claims

Neural networks can generalize long after apparent overfitting, in a phase transition called grokking.
The delay between memorization and generalization can span orders of magnitude in training steps.
Smaller datasets require more optimization steps for generalization to emerge.
Weight decay strongly promotes generalization; without it, grokking may not occur.
Grokking provides a tractable laboratory for studying generalization in overparameterized networks.

Methods

Experiments on algorithmically generated datasets: modular arithmetic (a+b mod p, a*b mod p, etc.) and permutation group composition. A small Transformer model is trained with Adam optimizer. Training and validation loss curves are tracked for far longer than typical. Ablations over dataset size, weight decay, and optimizer choice. The key observation is that validation accuracy follows a step-function increase long after training accuracy reaches 100%.

Connections

grokking — the phenomenon introduced
phase-transition — generalization emerges suddenly after extended memorization, a clear phase transition
scaling-laws — dataset size and training steps interact in predictable ways governing when grokking occurs
emergent-abilities — grokking is a discrete capability jump that cannot be predicted from training loss alone
openai — primary institution

Citation

Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. ICLR 2022 Workshop. https://arxiv.org/abs/2201.02177

ML Wiki

Explorer

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Summary

Key Claims

Methods

Connections

Citation

Graph View

Table of Contents

Backlinks