Concepts: residual connections | vanishing gradients | batch normalization Builds on: VGGNet (explainer coming) | Highway Networks (explainer coming) Leads to: An Image is Worth 16x16 Words (ViT)

The degradation problem no one expected

The story everyone in deep learning told themselves in 2014 was simple: deeper networks are better networks. More layers, more capacity, more powerful features. The recipe was clear.

Then someone trained a 56-layer network on CIFAR-10. It was worse than the 20-layer version. Not just on the test set — on the training set. This is not overfitting. A 56-layer network should be at least as good as a 20-layer one. You could always set the extra 36 layers to identity (just pass the input through untouched) and match the shallower model exactly. But plain networks couldn’t learn that. They got worse with depth.

The paper calls this the degradation problem. It was unexpected, it was real, and it broke the simple story.

The residual trick

Let’s start with an analogy. A hotel renovation team doesn’t design the building from scratch. The architects hand them a punch list: “keep everything, here’s what needs to change.” Replace the lobby flooring. Add a wall between rooms 14 and 15. Remove the old fixtures. The workers learn the delta — the corrections — not the full building. That’s much easier than designing the whole thing anew.

That’s exactly what residual learning does.

In a plain network, each block tries to learn the full mapping from input to output. Call the desired mapping H(x). The layers are trying to approximate H(x) from scratch using weights and nonlinearities. For very deep networks, this becomes unreliable — the optimizer struggles.

Residual learning reframes the problem. Instead of learning H(x), the layers learn F(x) = H(x) − x, the correction. Then the block adds the original input back:

The term F(x) is the “residual” — what needs to change. The term x passes through unchanged via a shortcut connection. No extra parameters, no extra computation. Just an addition.

“Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x) − x.”

Translation: don’t ask the layer to produce the answer. Ask it to produce the correction. The answer is input plus correction.

Here’s the architecture, visually:

Plain block:                Residual block:

x ──→ Conv → ReLU ──→       x ──────────────────────┐
      Conv → ReLU ──→            ↓                  │  (shortcut)
                                Conv → BN → ReLU    │
                                Conv → BN            │
                                   ↓ F(x)            │
                              (+) ←──────────────────┘
                               ↓
                             H(x) = F(x) + x → ReLU

The shortcut path costs nothing. Zero extra parameters. Zero extra operations beyond a single addition. The entire innovation is that plus sign at the bottom.

Why this works: the gradient argument

“We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping with a stack of nonlinear layers.”

Translation: learning to do nothing is hard for a stack of ReLUs and convolutions. Learning small corrections from a zero start is easy. The residual formulation gives the optimizer a natural “nothing to do” baseline — F(x) = 0 — that it can always fall back to.

There’s also a gradient flow argument. During backpropagation, the gradient through a residual block is:

That +I term is the identity. Even if the learned gradients ∂F/∂x collapse toward zero in deep layers (the classic vanishing gradient problem), the identity term keeps a clean gradient highway all the way back to early layers. The depth tax on gradients is gone.

Numeric walkthrough

Let’s trace one residual block with a tiny example. Suppose we have a 2-dimensional input and a simplified linear residual function (ignoring convolution for clarity).

Input:

Residual layer weights (2×2): , bias =

Step 1: compute the residual F(x) = Wx

F(x)[0] = 0.1 × 1.0  +  0.05 × 0.5  =  0.100 + 0.025  =  0.125
F(x)[1] = 0.05 × 1.0  +  0.1 × 0.5  =  0.050 + 0.050  =  0.100

Step 2: add the shortcut H(x) = F(x) + x

H(x)[0] = 0.125 + 1.0 = 1.125
H(x)[1] = 0.100 + 0.5 = 0.600

Now compare: if the true target is [1.1, 0.6], the residual block only needed to learn F(x) ≈ [0.1, 0.1] — a small correction. Without the shortcut, the layers would need to reconstruct [1.1, 0.6] entirely from [1.0, 0.5]. Same capacity. Very different optimization difficulty.

Now imagine this stacked 50 times. Each block only needs to represent a small refinement. The signal accumulates cleanly. The gradients flow cleanly. The optimizer never has to hold the entire representation in mind — just the incremental delta.

What’s clever about it

The degradation problem revealed something subtle: plain networks can’t learn identity mappings in practice, even when that’s the theoretically optimal behavior.

“If the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart.”

This sounds obvious. It isn’t. The issue is optimization geometry. Stacked ReLUs and convolutions are not symmetric around identity — there’s no natural initialization that puts them near “do nothing.” Residual connections fix this by making identity the default: if the weights stay near zero, the block passes input through unchanged. The network starts from a good place.

What makes this remarkable in retrospect is how little it costs. No new architecture components. No new training procedure. No new regularization. One addition operation per block, and suddenly 152-layer networks train stably.

“Our 152-layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets.”

Translation: more depth, fewer parameters than the previous dominant model. Depth became free.

Results: does it hold up?

ModelDepthTop-5 Error (ImageNet val)
VGG-19 (plain)19 layers7.32%
34-layer plain34 layers7.53%
ResNet-3434 layers5.71%
ResNet-152152 layers4.49%
Ensemble of ResNets3.57%

The comparison that matters: plain-34 vs ResNet-34. Same depth, same parameter budget, same training setup. Plain: 7.53%. Residual: 5.71%. That gap is the shortcut connection.

The 3.57% ensemble result won ILSVRC 2015 classification. The same ResNets won detection, localization, COCO detection, and COCO segmentation. Five competitions. One architectural change.

What doesn’t work or scale simply:

When input and output dimensions differ (going from 64 to 128 channels), the identity shortcut can’t be added directly — you need a 1×1 projection convolution to match dimensions. This adds parameters and is handled in the “bottleneck” design. It works, but it’s no longer free.

The paper also acknowledges that the theoretical explanation is incomplete. Why exactly does residual learning avoid degradation? The vanishing gradient story is compelling but doesn’t fully account for it. Later work showed that residual networks also create ensemble-like behavior (many implicit shorter paths through the network), but that understanding came after.

Very wide networks benefit less. The ResNet architecture optimizes for depth. Width-focused variants like Wide ResNet (Zagoruyko & Komodakis, 2016) showed you can trade depth for width while matching or exceeding performance with fewer layers.

If you’re building ML systems

ResNet-50 is still a legitimate baseline for image classification and feature extraction in 2024. Not the best available, but battle-tested, well-understood, and available pretrained in every framework. If you’re uncertain which backbone to start with for a vision task, start here and see what you’re actually comparing against before reaching for something heavier.

The deeper lesson is about skip connections as a pattern, not a model. Every transformer block uses them. The “Add & Norm” step in attention is exactly F(x) + x with a normalization. Mamba uses them. UNets for diffusion models are built around them. They migrated out of CNNs into nearly everything because the core insight generalizes: compute a delta, add it to the input, let gradients flow freely.

If you’re designing a new deep architecture and not including skip connections, you need a very good reason. The default is to include them. The burden of proof is on leaving them out.

When studying ViT and wondering why transformers work at scale for vision — part of the answer is residual connections. Every transformer layer is a residual block. ResNet proved those are reliable at depth. ViT inherited that stability for free.

Residual connections: the addition that made depth free.

Paper: Deep Residual Learning for Image Recognition — Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun — 2015

Connections

Citation

arXiv:1512.03385

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385.