What It Is
Batch normalization normalizes each layer’s activations to have zero mean and unit variance across the current training batch, then applies learned scale and shift parameters. It is placed after convolutions and before activation functions.
Why It Matters
Without normalization, activation distributions shift as earlier layers update (internal covariate shift), forcing each layer to constantly adapt to a moving target. Batch normalization stabilizes these distributions, allowing higher learning rates, reducing sensitivity to initialization, and acting as a regularizer.
How It Works
For a mini-batch of activations, compute the batch mean μ and variance σ². Normalize: x̂ = (x − μ) / √(σ² + ε). Then apply learned parameters: y = γx̂ + β. At inference, use running statistics accumulated during training rather than batch statistics. ResNet uses batch normalization after each convolution, which was critical for training 50–152 layer networks stably.