Per-parameter learning rates that adjust based on the gradient history — larger steps for rare/small gradients, smaller steps for frequent/large ones.

A flat global learning rate treats all parameters identically. But parameters differ: some receive dense gradient updates (e.g. embeddings for common words), others receive sparse ones (e.g. embeddings for rare words). A rate that works for the frequent parameters is too small for the rare ones, and vice versa.

Adaptive methods solve this by maintaining a per-parameter scaling factor derived from gradient history. Adagrad accumulates all squared gradients (leading to monotonically shrinking steps). Adam uses an exponential moving average of squared gradients (v_t), which decays old information and keeps the effective learning rate bounded.

Key Sources