press to watch the network learn
Memorizing
Generalizing
Grokked!
0
training step
0
steps/sec
0.0%
train accuracy
0.0%
test accuracy
what the network sees — before training
correct
wrong
no prediction yet
training (white dot) test (no dot)
training (white dot) test (no dot)
x-axis = a, y-axis = b. watch the test cells turn green.
Accuracy Over Time
blue = seen data, green = unseen data. watch the green line.
Loss Over Time
lower is better. watch test loss plateau then crash — that's the transition.
fourier spectrum — what frequencies the network uses
each bar = one frequency component (k=1 to p/2). after grokking, only a few bars should spike — those are the frequencies that solve modular addition.
try it yourself idle
+
mod 97
Network: not initialized
Try this before and after grokking. Early on, it memorizes training pairs but guesses randomly on new ones. After grokking, it gets everything right.
what the neurons learn — first-layer weights on a clock
each circle = one hidden neuron. the p points around the clock show its weight for each input value. smooth waves = Fourier features (the network found the circular structure of mod p).
before (step 0)
now (step 0)
learned representations — PCA of weight vectors
each dot = one hidden neuron's weight vector, projected to 2D via PCA. if neurons learn Fourier features, similar-frequency neurons should cluster. color = index.
before (step 0)
now (step 0)