stochastic-gradient-descent

Gradient descent using a random mini-batch of training examples per step, rather than the full dataset.

Instead of computing the exact gradient over all training data (expensive), SGD estimates the gradient using a small random subset (a mini-batch). The estimate is noisy but cheap. Over many steps, the noise averages out and the optimizer converges. The stochasticity also acts as implicit regularization — the noise prevents the optimizer from settling perfectly into every local minimum.

SGD with a flat learning rate treats every parameter identically, which becomes a problem when gradient magnitudes vary dramatically across parameters. Adaptive methods like Adam address this by tracking per-parameter gradient statistics.

Key Sources

adam-a-method-for-stochastic-optimization

ML Wiki

Explorer

stochastic-gradient-descent

Key Sources

Graph View

Backlinks