Gradient descent using a random mini-batch of training examples per step, rather than the full dataset.

Instead of computing the exact gradient over all training data (expensive), SGD estimates the gradient using a small random subset (a mini-batch). The estimate is noisy but cheap. Over many steps, the noise averages out and the optimizer converges. The stochasticity also acts as implicit regularization — the noise prevents the optimizer from settling perfectly into every local minimum.

SGD with a flat learning rate treats every parameter identically, which becomes a problem when gradient magnitudes vary dramatically across parameters. Adaptive methods like Adam address this by tracking per-parameter gradient statistics.

Key Sources