Concepts: optimization | stochastic gradient descent | adaptive learning rate | momentum Builds on: plain SGD (no explainer yet) | Grokking Leads to: LoRA — which uses Adam to fine-tune only low-rank matrices
Part 1: The problem with one learning rate
Vanilla SGD gives every parameter in your model the same learning rate. Every single one. The embedding for the word “the” (which appears in nearly every sentence) gets the same step size as the embedding for “xylophone” (which appears almost never). That is not a sensible arrangement.
Some parameters receive gradient updates constantly, from almost every batch. Others receive updates rarely, only when the relevant input pattern appears. A flat learning rate is too large for the frequent parameters — it overshoots and oscillates — and too small for the rare ones — they barely move. It is the wrong shoe size for almost every foot in the room.
The question Adam answers: can we give each parameter its own adaptive step size, derived automatically from the gradient history?
Part 2: How Adam works
The hiker analogy
Imagine a hiker navigating a foggy mountain, trying to find the lowest valley. The hiker cannot see more than a few feet ahead. But the hiker is smart:
- They remember the general direction they have been moving (not just the last step, but a weighted average of recent steps). This is momentum.
- They track how consistent versus bumpy the terrain has been in each direction. A path that has consistently sloped downward in the east direction is reliable. A path that alternates violently up and down in the north direction is noisy — trust it less.
- They take bigger steps in smooth, consistent directions. Smaller steps in noisy, uncertain directions.
That is Adam. Step by step.
Mechanism in plain English
- Compute the gradient g_t at the current parameter values. This is the slope at your current position.
- Update the first moment (momentum): m_t = β₁ · m_{t-1} + (1-β₁) · g_t. This smooths the gradient signal by keeping a running average of where you have been heading.
- Update the second moment (variance): v_t = β₂ · v_{t-1} + (1-β₂) · g_t². This tracks how large the gradients have typically been — the magnitude, not just the direction.
- Apply bias correction: m̂ = m_t / (1-β₁^t), v̂ = v_t / (1-β₂^t). Critical for early training — explained below.
- Update: θ = θ - α · m̂ / (√v̂ + ε). Step in the smoothed direction, scaled by how reliable that direction has been.
The ASCII picture
Plain SGD: gradient → fixed step size → update
[steep] [α=0.001] [-0.001 * gradient]
[gentle] [α=0.001] [-0.001 * gradient] ← same! Wrong.
Adam:
gradient → first moment (smooth) ─┐
→ second moment (scale) ─┤→ adaptive step → update
→ bias correction ─┘
[steep, consistent] → bigger step
[steep, noisy] → smaller step ← adapts!
The math, with every symbol translated
m_t = β₁ · m_{t-1} + (1-β₁) · g_t
m_t is the running average of recent gradients — the direction the hiker has been heading. β₁ = 0.9 means the history gets 9x more weight than the new gradient. The signal stays smooth, not jumpy.
v_t = β₂ · v_{t-1} + (1-β₂) · g_t²
v_t is the running average of squared gradients — how large and consistent the gradient signals have been. β₂ = 0.999 means the variance estimate changes very slowly. A parameter that consistently receives large gradients will have a large v_t; one that receives small or infrequent gradients will have a small v_t.
m̂_t = m_t / (1 - β₁^t), v̂_t = v_t / (1 - β₂^t)
Both m and v start at zero. After step 1, m_t = 0.1 · g_1 — not g_1, not a real average, just a tenth of g_1, pulled toward zero by the zero initialization. The correction term 1/(1-β^t) fixes this exactly. At t=1 with β₁=0.9, the correction is 1/(1-0.9) = 10. At t=100, it is 1/(1-0.9^100) ≈ 1.00003. The correction fades to nothing as training progresses. This is the underappreciated trick that makes Adam work well from step 1.
The paper puts it directly: “These bias corrections counteract the initialization bias.”
θ_{t+1} = θ_t - α · m̂_t / (√v̂_t + ε)
Step in the smoothed gradient direction m̂, but scale the step down by how large the gradients have typically been (√v̂). Parameters with consistently large gradients get smaller effective step sizes. Parameters with consistently small or rare gradients get larger effective step sizes. ε = 1e-8 prevents division by zero.
Four direct quotes from the paper
The paper describes the algorithm’s properties:
“The method is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters.”
Translation: Adam does not need to store a full matrix of second-order information. It stores just two extra vectors (m and v) the same size as the parameters. That is the “little memory” claim — modest compared to methods like L-BFGS that store curvature approximations.
“The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients.”
Translation: When gradients are sparse (most entries zero, occasional large signal), adaptive methods outperform SGD dramatically. Word embeddings are the canonical example — “the” updates constantly, “xylophone” updates rarely. Adam handles both.
“The hyper-parameters have intuitive interpretations and typically require little tuning.”
Translation: β₁ = 0.9, β₂ = 0.999, ε = 1e-8, α = 0.001 are the defaults, and they work well across a remarkable range of problems. You tune α; the others rarely need adjustment.
“Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods.”
Translation: The paper tested on logistic regression, multilayer networks, convolutional networks, and language models. Adam matched or beat Adagrad, RMSProp, and SGD+momentum across the board in convergence speed.
What is clever about bias correction
Both m and v start at zero. Without correction, the first few update steps are garbage — everything is pulled toward zero because the EMA (exponential moving average) has not had time to accumulate signal. The correction term 1/(1-β^t) is exactly right to undo this. At t=1, it equals 1/(1-β). As t → ∞, β^t → 0 and the correction → 1. It is a mathematically clean, analytically exact fix, not an approximation.
The insight is recognizing that initialized-to-zero EMAs are biased in a predictable way — one you can correct analytically. Most practitioners skip over this step; the paper treats it as a first-class contribution.
Numeric walkthrough
Setup: 3 parameters, β₁=0.9, β₂=0.999, α=0.001, ε=1e-8
Initial: m₀=[0, 0, 0], v₀=[0, 0, 0]
Step t=1, gradients: g₁=[0.1, -0.5, 0.3]
First moment:
m₁ = 0.9·[0,0,0] + 0.1·[0.1, -0.5, 0.3]
= [0.01, -0.05, 0.03]
Second moment:
v₁ = 0.999·[0,0,0] + 0.001·[0.01, 0.25, 0.09]
= [0.00001, 0.00025, 0.00009]
Bias correction (t=1, so 1-β^1 = 1-β):
m̂₁ = [0.01, -0.05, 0.03] / (1 - 0.9) = [0.1, -0.5, 0.3]
v̂₁ = [0.00001, 0.00025, 0.00009] / (1 - 0.999) = [0.01, 0.25, 0.09]
Update step:
√v̂₁ = [0.1, 0.5, 0.3]
step = 0.001 · [0.1/0.1, -0.5/0.5, 0.3/0.3]
= 0.001 · [1.0, -1.0, 1.0]
= [0.001, -0.001, 0.001]
Note: all three parameters get the SAME step size (0.001) on step 1 —
Adam normalizes the gradient direction on the first step. The variance
starts making a difference from step 2 onward as v accumulates history.
The normalization on step 1 is not a flaw — it is correct behavior given zero history. From step 2 onward, parameters that have seen large gradients accumulate larger v values and get smaller effective step sizes. The adaptive behavior kicks in as history builds.
Part 3: Results and what breaks
What the paper reports
| Benchmark | Adam result | Comparison |
|---|---|---|
| MNIST logistic regression | Faster convergence than Adagrad, RMSProp, SGDm | Best converge rate in first 200 epochs |
| MNIST multilayer neural net | Matched RMSProp, outperformed Adagrad and SGD | Stable across learning rates |
| CIFAR-10 convolutional network | Competitive with SGD+momentum | Close to best test accuracy |
The paper also tests on a character-level language model (on a Wikipedia corpus), where Adam converges faster than Adagrad and SGD+momentum.
What does not work
Adam has well-documented failure modes, most identified after the 2015 paper:
Adam can overfit more than SGD on image classification. The generalization gap is real and reproducible: Adam often converges faster but to sharper minima that generalize slightly worse on held-out data. On ImageNet with a ResNet, carefully tuned SGD+momentum typically beats Adam on final test accuracy by 1-3%.
Adam uses 2x more memory than vanilla SGD. You store m and v for every parameter, on top of the gradients themselves. For a 70B parameter model, that is tens of gigabytes of extra state.
The paper’s benchmarks are relatively small by today’s standards. The advice for extreme scale — hundreds of billions of parameters trained on trillions of tokens — is less clear from the original experiments.
The default ε = 1e-8 can cause issues in some settings (very sparse gradients, very small batches). Some practitioners increase ε to 1e-7 or 1e-6 to improve stability.
Part 4: So what? Connections and consequences
If you are building ML systems, this matters
Use Adam as the default for transformers, NLP, and reinforcement learning. It requires less learning rate tuning than SGD and handles the sparse gradient case (common in embeddings) gracefully.
Consider SGD with momentum for image classification if you have time to tune a learning rate schedule. With the right warmup and cosine decay, SGD+momentum can match or beat Adam on final test accuracy for vision tasks. The tradeoff is effort: Adam works well out of the box.
For fine-tuning — LoRA, full fine-tuning, instruction tuning — Adam is almost always the right choice. You are not training from scratch; you are making targeted adjustments to an already-trained model, and the adaptive step sizes help enormously when different layers are at different stages of convergence.
The bias correction matters most in the first 100-500 steps. If your loss is spiky early in training, check whether your optimizer implementation applies bias correction correctly. Some early implementations skipped it.
Connections to other work
Grokking shows that optimizer dynamics directly cause the delayed generalization phenomenon. The interplay between the optimizer’s momentum terms and weight decay determines whether and when a network transitions from memorization to generalization. Slower optimizers with more regularization favor grokking. Adam’s fast convergence can actually work against grokking — it reaches the memorizing solution quickly and the correction happens later.
LoRA uses Adam on a tiny fraction of parameters — the injected low-rank matrices A and B — while freezing everything else. Adam’s adaptive step sizes are especially valuable here: the low-rank matrices start at near-zero (B is initialized to zero, A to random), and Adam’s bias correction helps the early steps when gradient history is absent.
InstructGPT and modern RLHF pipelines depend on Adam throughout. The reward model training, supervised fine-tuning stage, and PPO policy gradient updates all use Adam. The adaptive step sizes matter for the reward model in particular, where different preference signals arrive at different frequencies.
One-liner
Adam: give each parameter its own learning rate, derived from its own gradient history — the closest thing to a free lunch in deep learning optimization.
Citation
Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR).
Connections
- Grokking — optimizer dynamics directly affect delayed generalization
- LoRA — uses Adam to fine-tune low-rank matrices efficiently
- optimization
- stochastic-gradient-descent
- adaptive-learning-rate
- momentum
- bias-correction