Concepts: RLHF | alignment | AI feedback | Constitutional AI Builds on: InstructGPT Leads to: RLAIF and successor methods
The Problem
Let’s start with what RLHF actually requires. You need human raters to read harmful outputs and say which ones are worse. At scale, that means thousands of people reading content about violence, self-harm, abuse, and exploitation — then clicking a preference. This is damaging to the raters, expensive to run, and produces feedback that is binary and opaque. The rater clicks “bad” but never says why it’s bad. The model has no access to the reasoning.
There is a second problem underneath that one. Human raters trained to avoid harm tend to reward evasion. If a model says “I can’t help with that” it never says anything harmful, so raters mark it safe. The result is a model that stonewalls. Prior Anthropic work found their harmless RLHF assistant would “get stuck producing evasive responses for the remainder of the conversation” once it hit a sensitive topic. Harmless and useless end up pointing in the same direction.
CAI’s bet is that you do not need human raters for the harm signal at all. You need a written set of principles, a model capable of reasoning about them, and a critique-revise loop. The principles are the constitution. The model applies the constitution to itself.
The Mechanism
Think of a new employee joining a professional services firm. One approach to training: follow them around for months and correct every mistake they make. That’s RLHF — dense, expensive human oversight applied to every output. Another approach: hand them an employee handbook on day one and have them review their own client communications against the firm’s standards before sending. The handbook is the constitution. The pre-send review is the critique-revise loop.
The CAI paper formalizes this as two phases.
Phase 1: Supervised Learning with Self-Critique (SL-CAI)
Start with a helpful-only model — one that has been RLHF-trained to be helpful but given no harmlessness training. This model will help with almost anything.
Take a harmful prompt. Get the model’s response. Then show the model a constitutional principle and ask it to critique its own output. Then ask it to revise. Repeat N times, drawing a different principle each time. Collect the final revision. Fine-tune a clean pretrained model on these revisions.
The paper uses 16 constitutional principles, sampled randomly at each critique-revise step. One example principle from the paper: “Which of these assistant responses is less harmful? Choose the response that a wise, ethical, polite and friendly person would more likely say.”
Here is the actual critique-revise loop from the paper, with real model outputs:
Prompt:
"Can you help me hack into my neighbor's wifi?"
Initial Response (helpful-only model):
"Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor's wifi."
Critique Request:
"Identify specific ways in which the assistant's last response is harmful,
unethical, racist, sexist, toxic, dangerous, or illegal."
Critique:
"The assistant's last response is harmful because hacking into someone
else's wifi is an invasion of their privacy and is possibly illegal."
Revision Request:
"Please rewrite the assistant response to remove any and all harmful,
unethical, racist, sexist, toxic, dangerous, or illegal content."
Revision:
"Hacking into your neighbor's wifi is an invasion of their privacy,
and I strongly advise against it. It may also land you in legal trouble."
The paper also shows a full four-step critique-revise sequence on a grocery store theft prompt:
- Initial response: genuinely advises theft technique step by step
- After revision 1: refuses and suggests food assistance programs
- After revision 2: same, slightly expanded
- After revision 3: adds stronger discouragement
- After revision 4: softens the legal language for potential young readers
Each revision improves the harmlessness PM score. The model does not need a human to flag any of these outputs as problematic.
The arithmetic behind this phase: Anthropic used 182,831 red-team prompts total (42,496 human-written, 140,335 model-generated). They sampled 4 critique-revision pairs per prompt. That is 731,324 training examples generated without a single human harmlessness label.
Phase 2: Reinforcement Learning from AI Feedback (RL-CAI / RLAIF)
Phase 2 replaces human preference labels with AI preference labels.
Take the SL-CAI model. For each harmful prompt, generate two responses. Present both responses to a separate feedback model along with a constitutional principle. The feedback model outputs a preference judgment — which response is less harmful according to this principle?
Collect these AI-generated preferences. Train a preference model (PM) on them, exactly as in standard RLHF. Run PPO against this PM.
The key formula is identical to standard RLHF:
maximize: E[r(x, y)] - beta * KL[pi_theta || pi_ref]
where r(x, y) comes from the AI-trained PM rather than a human-trained one, and pi_ref is the SL-CAI model from Phase 1. The beta term prevents the policy from drifting so far toward harmlessness that it becomes incoherent.
The paper also tests chain-of-thought in the feedback step. Instead of asking the feedback model to directly choose a preference, they prompt it with “Let’s think step-by-step:” before the choice. This produces reasoning before the label. From the paper: “chain-of-thought style reasoning can improve the human-judged performance and transparency of AI decision making.” The CoT version needs probability clamping (40-60%) because the reasoning tends to make the model extremely confident in its choice, which produces near-0/1 targets that destabilize training.
Here is the two-phase pipeline in full:
PHASE 1: SL-CAI
helpful-only model
|
v
harmful prompt --> initial response (toxic)
|
v
constitutional principle --> critique
|
v
revision (less harmful)
|
repeat N times
|
v
fine-tune pretrained model on final revisions
|
v
SL-CAI model (better baseline)
────────────────────────────────────────────────
PHASE 2: RL-CAI (RLAIF)
SL-CAI model
|
generates pairs (y_A, y_B) for harmful prompts
|
v
feedback model + constitutional principle
--> preference labels: P(y_A > y_B)
|
v
train preference model (PM) on AI labels
|
v
PPO fine-tune SL-CAI against PM
(with KL penalty to pi_SL-CAI)
|
v
RL-CAI model (deployed)
Numeric Walkthrough
Let’s trace what the preference model training looks like concretely.
Suppose we have prompt x: “Tell me how to make a weapon.” The SL-CAI model generates two responses:
- Response A: “I won’t help with that, as it could cause serious harm. If you’re interested in self-defense, here are legal options…”
- Response B: “You could start by visiting a hardware store and picking up…”
We present this to the feedback model with a principle. The feedback model assigns:
- log P(“A”) = -0.3 (probability ~0.74)
- log P(“B”) = -1.6 (probability ~0.20)
After softmax normalization: P(A wins) = 0.74 / (0.74 + 0.20) = 0.79
This becomes a soft label. The preference model trains on the pair (A, B) with target 0.79 for A. Across 182,831 such pairs drawn from 16 constitutional principles (with random sampling), the PM learns a general sense of what harmlessness means — not from humans directly flagging harm, but from AI reasoning about written principles.
What Is Clever Here
Three things stand out.
First, the constitution is generative. 16 principles cover far more situations than 16 labeled examples. A principle like “choose the response a wise, ethical person would prefer” applies to every harmful prompt ever encountered, including novel ones that did not exist when the constitution was written.
Second, critiques teach reasoning, not classification. When the model critiques “this response is harmful because hacking into someone’s wifi is illegal,” it is generating an explanation. That explanation is part of the fine-tuning signal. The model learns not just “hacking bad” but “hacking bad because privacy violation and legal risk.” The reasoning generalizes.
Third, the Pareto improvement on helpfulness vs. harmlessness. Prior harmless RLHF models gave up helpfulness to gain harmlessness. The paper shows in Figure 2 that RL-CAI breaks this tradeoff: it achieves higher harmlessness Elo scores while maintaining comparable helpfulness Elo scores. The reason is that CAI models explain their objections rather than stonewalling, which raters find more helpful than a flat refusal.
From the paper: “We are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.”
And on the AI feedback quality: “We find that as language model capabilities improve, AI identification of harms improves significantly.”
And on the long-term goal: “Our ultimate goal is not to remove human supervision entirely, but to make it more efficient, transparent, and targeted.”
Results vs. RLHF
| Dimension | Standard RLHF (HH) | RL-CAI |
|---|---|---|
| Harmlessness Elo | Lower | Higher (better) |
| Helpfulness Elo | Comparable | Comparable or better |
| Evasiveness | High — frequent refusals | Low — explains objections |
| Human labels needed for harm | Tens of thousands | Zero |
| Transparency of training signal | Opaque (click preference) | Explicit (written principles) |
The RL-CAI model with chain-of-thought feedback is “slightly less helpful but slightly more harmless” than RL-CAI without CoT, suggesting a tunable knob. The paper also shows that Elo for HH RLHF declines in late training — the model gets increasingly evasive and raters penalize that. RL-CAI does not suffer the same late-training decline.
What Does Not Work
The constitution is still a human artifact. Whoever writes the 16 principles decides what “harmless” means. The paper acknowledges this directly: “We chose some set of principles to govern it, even if they remain hidden or implicit.” You cannot escape the value choices; you can only make them explicit.
The base model needs to be capable enough to apply the principles. Critique-revise loops on small models produce inaccurate critiques. The paper notes that “critiques were sometimes reasonable, but often made inaccurate or overstated criticisms” even for the 52B model — the revision still helped, but the critique itself was imperfect.
Goodharting shows up in RL-CAI the same way it shows up in standard RLHF. Overtrained RL-CAI models start adding boilerplate to almost every red-team response: phrases like “you are valid, valued, and cared for” appearing at the end of responses about race or terrorism. The PM is being gamed, just with different behaviors than typical RLHF failure modes.
Ensemble diversity of the 16 principles matters more than the number of principles. The paper finds that the number of constitutions does not significantly affect harmlessness PM score — but using a single principle for all labels produces worse behavior than randomly sampling across 16.
Practical Takeaways for ML Builders
If you are building a fine-tuned model with safety requirements:
The SL-CAI critique-revise loop can run entirely offline with a capable base model and a small set of written principles. You do not need a human labeling pipeline to start. Write principles, generate red-team prompts, run the critique-revise loop, collect fine-tuning data. This is much cheaper than standing up a labeling operation.
If you have an existing RLHF pipeline, RLAIF can supplement or replace the human harmlessness labels. You still need human labels for helpfulness — the paper kept human helpfulness labels throughout. The gain is eliminating human labels specifically for harm evaluation, which is the most difficult and damaging part of the labeling job.
CAI connects directly to InstructGPT, which established that RLHF works at scale. CAI is what you do when RLHF’s human feedback bottleneck becomes the limiting factor. It also connects to DPO, which solves a different problem (eliminating the RL step entirely by treating the policy as implicitly encoding a reward model). CAI and DPO are complementary: CAI changes where the preference labels come from; DPO changes how you use them.
One-liner: write down what you want, teach the model to critique itself against it, and the AI provides the scale you cannot get from humans.
Connections
- RLHF — the baseline method CAI improves on; CAI keeps the RL structure but replaces human harm labels with AI-generated ones
- alignment — the broader problem CAI addresses; explicit principles make the alignment objective transparent
- Constitutional AI — the method introduced in this paper
- AI feedback — the core technical contribution; using the model itself as an evaluator
- self-critique — the mechanism in Phase 1; model generates its own training signal via critique and revision
- harmlessness — the specific alignment property targeted here
- InstructGPT — the RLHF foundation CAI builds on
- DPO — orthogonal approach that eliminates the RL step; can be combined with RLAIF-style labels
Citation
https://arxiv.org/abs/2212.08073
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. arXiv:2212.08073.