What It Is
A machine learning paradigm where an agent learns by interacting with an environment — taking actions, receiving scalar reward signals, and updating its policy to maximize cumulative reward over time. No labeled data; only the reward signal guides learning.
Why It Matters
RL is how language models are trained to follow human preferences (RLHF), how AlphaGo learned to play Go, and how robots learn to walk. It’s the bridge between supervised learning (which requires labeled examples) and open-ended optimization (which requires only a reward function).
How It Works
The agent observes a state , selects an action according to its policy , receives reward , and transitions to state . The goal is to find that maximizes expected discounted return . Policy gradient methods like PPO optimize directly by ascending the gradient of expected reward. Value-based methods like Q-learning learn a value function or and derive a policy from it.