Vanishing Gradients

What It Is

Vanishing gradients occur when gradients shrink exponentially as they propagate backward through many layers during training, causing early layers to update so slowly they barely learn.

Why It Matters

Without a gradient highway through the network, deeper layers train well but early layers receive near-zero signal. This was a fundamental barrier to training deep neural networks before architectural solutions (residual connections, careful initialization, normalization) addressed it.

How It Works

Each layer multiplies the incoming gradient by its local Jacobian. If those Jacobians have values below 1, repeated multiplication across 50+ layers drives the gradient toward zero. The result: early layers see update signals that are millions of times smaller than later layers. Residual connections solve this by providing an identity path where the gradient is always +1, regardless of what the learned layers do.

Key Sources

deep-residual-learning-for-image-recognition
flash-attention-fast-and-memory-efficient-exact-attention

residual-connections
batch-normalization
optimization

ML Wiki

Explorer

Vanishing Gradients

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Vanishing Gradients

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks