Batch normalization (Ioffe & Szegedy, 2015) was justified as reducing “internal covariate shift.” Santurkar et al. at MIT (2018) tested this directly and found BN doesn’t reduce covariate shift. In some cases it increases it. The real reason it works: it smooths the loss landscape, making gradients more predictable so you can use higher learning rates. One of deep learning’s most used techniques, adopted for years on an incorrect theory.