Accumulating a velocity vector in directions of persistent gradient, dampening oscillations and accelerating convergence.

Without momentum, SGD takes noisy, erratic steps — every batch introduces a new gradient estimate that can point in a slightly different direction. Momentum smooths this by keeping a running weighted average of past gradients. If the gradient has been pointing consistently in one direction, momentum builds up speed in that direction. If the gradient oscillates, momentum dampens the oscillations.

In Adam, the first moment estimate m_t is the momentum term: m_t = β₁ · m_{t-1} + (1-β₁) · g_t. With β₁ = 0.9, the current gradient gets 10% weight and the history gets 90% weight. The update uses m̂_t (the bias-corrected version) rather than the raw gradient.

Key Sources