AI Feedback (RLAIF)

What It Is

AI feedback (or RLAIF — Reinforcement Learning from AI Feedback) is the practice of using a language model to generate preference labels for training, replacing or supplementing human raters. Instead of asking a human “which response is better?”, you ask a capable model the same question, optionally with a guiding principle or chain-of-thought prompt.

Why It Matters

Human preference labeling is the bottleneck in RLHF pipelines — expensive, slow, and harmful to raters when the content involves violence, exploitation, or abuse. AI feedback removes that bottleneck for the harmlessness component specifically. The key empirical finding from Constitutional AI: AI-generated preference labels produce comparable or better harmlessness training signal than human labels, when the model is prompted with explicit principles.

How It Works

Present a feedback model with two responses to a prompt and a principle for evaluation. Compute the log probability the model assigns to each choice. Use these as soft preference targets. Train a preference model on the resulting comparisons. Feed that PM into a standard RLHF pipeline (PPO). Chain-of-thought prompting before the preference judgment improves quality but requires probability clamping (40-60% range) to avoid overconfident labels that destabilize training.

Key Sources

constitutional-ai-harmlessness-from-ai-feedback
self-rewarding-language-models — extends RLAIF by making the feedback model non-frozen; the LLM improves as both generator and judge across DPO iterations
self-rag-learning-to-retrieve-generate-critique — uses GPT-4 to generate reflection token annotations (Retrieve/IsRel/IsSup/IsUse), then distills this into a critic model offline; the generator internalizes the critic via standard LM training

ML Wiki

Explorer

AI Feedback (RLAIF)

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

AI Feedback (RLAIF)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks