Batch Normalization

What It Is

Batch normalization normalizes each layer’s activations to have zero mean and unit variance across the current training batch, then applies learned scale and shift parameters. It is placed after convolutions and before activation functions.

Why It Matters

Without normalization, activation distributions shift as earlier layers update (internal covariate shift), forcing each layer to constantly adapt to a moving target. Batch normalization stabilizes these distributions, allowing higher learning rates, reducing sensitivity to initialization, and acting as a regularizer.

How It Works

For a mini-batch of activations, compute the batch mean μ and variance σ². Normalize: x̂ = (x − μ) / √(σ² + ε). Then apply learned parameters: y = γx̂ + β. At inference, use running statistics accumulated during training rather than batch statistics. ResNet uses batch normalization after each convolution, which was critical for training 50–152 layer networks stably.

Key Sources

batch-normalization-accelerating-deep-network-training
deep-residual-learning-for-image-recognition

residual-connections
vanishing-gradients
optimization

ML Wiki

Explorer

Batch Normalization

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Batch Normalization

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks