Your neural network just hit 100% training accuracy at step 1,000. Validation is still at 10% — random chance. You keep training. Nothing. 10,000 steps. Still 10%. 20,000 steps. 10%. And then, at step 33,000, without warning, it jumps to 99% on data it’s never seen. What just happened? This paper ran the controlled experiments to find out.

The analogy

Think of memorization vs. understanding like two ways to pass a math exam.

The first way: memorize every worked example in the textbook. You can reproduce any problem you’ve seen exactly. But give you a new one and you’re lost — you’re pattern-matching to memory, not reasoning.

The second way: actually understand the underlying rule. Slower to achieve, but once you have it, any problem works.

Your neural network starts with memorization — it’s the easy path. But memorization is expensive: you need to store every example explicitly, which requires large weights. Understanding is cheaper: the rule for modular addition compresses into a Fourier structure that takes much less “space” in weight terms.

This is where weight decay comes in. Weight decay is a continuous pressure that penalizes large weights. It can’t touch the memorizing solution while training loss is high — the network needs those weights to get training accuracy up. But once training is perfect, weight decay starts quietly shrinking everything. After enough steps, it shrinks the memorizing solution down far enough that the network “falls through” into the generalizing solution, which was always lower-norm.

That’s grokking.

The mechanism

The task is beautifully simple: learn modular addition.

Input:  a=47, b=62
Output: (47 + 62) mod 97 = 12

There are 9,409 possible pairs. The network sees 20% of them (1,882 pairs). Can it learn the rule without seeing the other 7,527?

Here’s what grokking looks like in practice:

Steps:    0      1k     5k     10k    20k    33k    50k    400k
          |      |      |      |      |      |      |      |
Train %:  10%    100%   100%   100%   100%   100%   100%   100%
Test  %:  10%    10%    10%    10%    10%    99%    99%    99%
                                              ↑
                                         GROKKING!
                               T_train=1k         T_grok=33k
                               ΔT = 32,000 steps of waiting

The paper defines grokking delay ΔT = T_grok - T_train — the steps spent in that limbo between “training is perfect” and “generalization appears.”

The question is: what makes ΔT long or short? Which knobs are actually causal?

ASCII diagram

Weight space — two attractors

      Memorizing               Generalizing
      attractor                attractor
         ___                      ___
        /   \     weight decay   /   \
  -----/ MEM \~~~~~~~~~~~~~~~~~~/ GEN \-----
        \___/   →→→→→→→→→→→→→   \___/
         ↑
    Large weights               Small weights
    (specific, brittle)         (compact, general)
    High L2 norm                Low L2 norm

Weight decay = constant leftward pressure.
Grokking = the moment the ball rolls over the hill.

The weight norm data from this paper makes this concrete. Across all configurations that grokked, width-512 models hit a threshold of ||W||_RMS ≈ 0.022 ± 0.003 at the grokking step — regardless of architecture, optimizer, or activation function. The number of steps to get there varies wildly, but the destination is the same.

The math that matters

The grokking delay metric:

ΔT = T_grok - T_train

where T_train = first step with ≥99% training accuracy, T_grok = first step with ≥99% test accuracy.

If a configuration never reaches 99% test accuracy within the training budget, it’s labeled DNF (did not grok).

This metric is doing important work. It separates two distinct failures: (1) the network can memorize but never generalizes (DNF_test), and (2) the network can’t even memorize — too much regularization (DNF_train). Both are distinct failure modes with different causes.

Walkthrough with actual numbers

Here are the real experimental results from the paper.

Experiment: weight decay sweep on a depth-4 GELU MLP (width 512, SGD, lr=0.03)

λ (weight decay)  │ Grokked / 3  │ Mean ΔT
──────────────────┼──────────────┼─────────────────
1×10⁻⁵            │ 0/3          │ DNF_test (memorizes, no grok)
5×10⁻⁵            │ 1/3          │ 388,000 steps
1×10⁻⁴            │ 3/3          │ 220,000 ± 36,056
5×10⁻⁴            │ 3/3          │ 45,333 ± 7,572
1×10⁻³            │ 3/3          │ 25,333 ± 5,033  ← SWEET SPOT
2×10⁻³            │ 0/3          │ DNF_train (can't even memorize)
5×10⁻³            │ 0/3          │ DNF_train

One factor-of-two step beyond the optimal (λ=1e-3 → λ=2e-3) causes complete training failure. The Goldilocks zone is that narrow.

Depth experiment results:

Depth  Architecture         Seeds grokked  Mean ΔT
─────  ───────────────────  ─────────────  ──────────────────
2      Flat MLP, width 256  4/5            72,000 ± 85,536
4      Flat MLP, width 256  0/5            DNF (all seeds)
8      Residual+LN, w=512   3/5            33,333 ± 12,858

Depth 4 with no residual connections: zero for five. Depth 8 with residual connections and LayerNorm: three of five succeed. The message: it’s not depth that helps — it’s depth with stabilization.

ReLU vs GELU with actual step counts:

At the Sweep B config (λ=5×10⁻⁴, width 512):

  • GELU: 5/5 seeds, mean ΔT = 45,600 steps
  • ReLU: 5/5 seeds, mean ΔT = 196,800 steps
  • That’s a 4.32× gap from activation choice alone

But switch to Sweep A config (λ=2×10⁻³, width 256):

  • GELU: 0/5 seeds — can’t even grok
  • ReLU: 2/5 seeds, mean ΔT = 266,000 steps

Same activation, different regime, completely different story.

What’s clever — find the instinct

The paper’s deepest finding is the weight norm experiment (Section 5.6).

They trained six different configurations — different architectures, activations, optimizers — and measured the RMS weight norm at the exact grokking step. The result:

All width-512 models grokked at ||W||_RMS ≈ 0.022, regardless of architecture or activation.

ReLU needed 180,000 steps to get there. GELU needed 26,000. But both crossed the same threshold.

This cleanly separates two questions that were previously tangled together:

  1. What norm threshold triggers generalization? — Universal, ~0.022, activation-independent
  2. How fast does weight decay drive the norm to that threshold? — Activation-dependent

The instinct here: if you had been watching weight norms instead of validation accuracy, you’d have seen this coming. The grokking “event” isn’t a sudden change in the network’s understanding — it’s a gradual compression that suddenly crosses a threshold. The “aha moment” has been building for 30,000 steps. It just hasn’t shown up in validation yet.

Real quotes from the paper

“Grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization.”

Translation: Don’t blame (or credit) Transformers vs. MLPs for grokking behavior. It’s the regularization regime and optimizer stability doing the work. Architecture is almost irrelevant once you control for those.

“Depth has a non-monotonic effect, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization.”

Translation: Adding layers can actually hurt if you don’t add the scaffolding (residual connections, LayerNorm) that keeps optimization tractable. Depth without stability = a deeper local minimum you can’t escape.

“The activation function controls the rate at which weight decay drives the norm to that threshold, not the threshold itself.”

Translation: GELU isn’t smarter than ReLU about what to learn. It’s faster at compressing what it knows — it reaches the critical weight norm threshold faster. The destination is the same; the journey time differs.

“Weight decay is the dominant control parameter, exhibiting a narrow ‘Goldilocks’ regime in which grokking occurs, while too little or too much prevents generalization.”

Translation: There is one knob that matters above all others. Too little weight decay and the network memorizes forever. Too much and it can’t even memorize. The sweet spot is narrow — sometimes a 2× step above the optimum completely kills training.

Does it actually work?

The paper’s main results table on architecture comparison:

ArchitectureConfigSeeds grokkedMean ΔTRatio
MLP (GELU, d=4)optimal λ=1e-35/526,800 ± 6,4191.0×
Transformer (1L)optimal λ=5.05/550,800 ± 38,7451.90×
MLP (GELU, d=4)matched λ=1e-45/545,600 ± 5,550
Transformer (1L)matched λ=1.05/550,800 ± 22,5651.11×

The earlier claim that Transformers grok 2.18× faster than MLPs? Mostly an artifact of giving Transformers better hyperparameters. Under matched configs: 1.11×. At each architecture’s own optimum: 1.90×. Both much smaller than the published gap.

What doesn’t work:

  • Everything is on modular addition mod 97. Real-world tasks may have fundamentally different grokking dynamics or no grokking at all.
  • The paper can identify the Goldilocks zone empirically but can’t predict where it is from first principles. You still have to sweep.
  • Even at optimal settings, some seeds fail (depth 8: 3 of 5). The stochasticity isn’t explained.
  • Only SGD and AdamW tested. Other optimizers (Lion, Muon) might change the picture.
  • Small models only. The 1-layer Transformer here is orders of magnitude smaller than anything you’d train at scale.

So what?

If you’re training models and watching validation loss plateau for a long time — don’t kill the run prematurely. What looks like a stall might be the slow weight-norm compression that precedes generalization. The tell: watch the RMS weight norm. If it’s still declining, something is still happening.

More practically: if you’re doing architecture comparisons, make sure you’re comparing at matched regularization regimes, not just matched architectures. The paper’s retroactive correction — finding that the 2.18× Transformer gap was really 1.11× or 1.90× depending on how you calibrate — is a warning about how easy it is to attribute optimizer confounds to architecture.

Weight decay is the real boss. Grokking is just gradient descent finding the compact solution it was always being pushed toward — it just needed enough time and the right regularization to get there.

Connections

  • grokking — builds on and extends the original grokking paper
  • transformer — one of the architectures studied
  • lora — weight compression connects to low-rank adaptation

Citation

arXiv:2603.25009

Manir, S. B., & Paul Rupa, A. (2026). A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization. https://arxiv.org/abs/2603.25009