What It Is

AI feedback (or RLAIF — Reinforcement Learning from AI Feedback) is the practice of using a language model to generate preference labels for training, replacing or supplementing human raters. Instead of asking a human “which response is better?”, you ask a capable model the same question, optionally with a guiding principle or chain-of-thought prompt.

Why It Matters

Human preference labeling is the bottleneck in RLHF pipelines — expensive, slow, and harmful to raters when the content involves violence, exploitation, or abuse. AI feedback removes that bottleneck for the harmlessness component specifically. The key empirical finding from Constitutional AI: AI-generated preference labels produce comparable or better harmlessness training signal than human labels, when the model is prompted with explicit principles.

How It Works

Present a feedback model with two responses to a prompt and a principle for evaluation. Compute the log probability the model assigns to each choice. Use these as soft preference targets. Train a preference model on the resulting comparisons. Feed that PM into a standard RLHF pipeline (PPO). Chain-of-thought prompting before the preference judgment improves quality but requires probability clamping (40-60% range) to avoid overconfident labels that destabilize training.

Key Sources