Reinforcement Learning

What It Is

A machine learning paradigm where an agent learns by interacting with an environment — taking actions, receiving scalar reward signals, and updating its policy to maximize cumulative reward over time. No labeled data; only the reward signal guides learning.

Why It Matters

RL is how language models are trained to follow human preferences (RLHF), how AlphaGo learned to play Go, and how robots learn to walk. It’s the bridge between supervised learning (which requires labeled examples) and open-ended optimization (which requires only a reward function).

How It Works

The agent observes a state $s_{t}$ , selects an action $a_{t}$ according to its policy $π_{θ} (a ∣ s)$ , receives reward $r_{t}$ , and transitions to state $s_{t + 1}$ . The goal is to find $π_{θ}$ that maximizes expected discounted return $E [\sum_{t = 0}^{\infty} γ^{t} r_{t}]$ . Policy gradient methods like PPO optimize $π_{θ}$ directly by ascending the gradient of expected reward. Value-based methods like Q-learning learn a value function $V (s)$ or $Q (s, a)$ and derive a policy from it.

ML Wiki

Explorer

Reinforcement Learning

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Reinforcement Learning

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks