Correcting the initialization bias in exponential moving averages that are started from zero.

Exponential moving averages (EMAs) initialized to zero are biased toward zero in early steps. After step 1, an EMA with β = 0.9 equals 0.1 · x_1 — not x_1, not a real average, just 10% of the first value. This underestimates the true mean systematically.

Adam corrects for this by dividing the EMA by (1 - β^t): m̂_t = m_t / (1 - β₁^t). At t=1, this equals 1/(1-β₁) = 10 with β₁=0.9 — exactly canceling the 0.1 factor. As t → ∞, β^t → 0 and the correction disappears. The fix is analytically exact, not an approximation.

This correction matters most in the first 100-500 training steps. Without it, both the momentum estimate and the variance estimate are pulled toward zero, causing the early updates to be too small and potentially too noisy.

Key Sources