What It Is

Vanishing gradients occur when gradients shrink exponentially as they propagate backward through many layers during training, causing early layers to update so slowly they barely learn.

Why It Matters

Without a gradient highway through the network, deeper layers train well but early layers receive near-zero signal. This was a fundamental barrier to training deep neural networks before architectural solutions (residual connections, careful initialization, normalization) addressed it.

How It Works

Each layer multiplies the incoming gradient by its local Jacobian. If those Jacobians have values below 1, repeated multiplication across 50+ layers drives the gradient toward zero. The result: early layers see update signals that are millions of times smaller than later layers. Residual connections solve this by providing an identity path where the gradient is always +1, regardless of what the learned layers do.

Key Sources