What It Is

Constitutional AI is Anthropic’s method for training harmless AI assistants using a written set of principles (the “constitution”) rather than human labels identifying harmful outputs. The model critiques and revises its own responses against these principles, and a separate AI-generated preference signal replaces human raters for harmlessness training.

Why It Matters

Human harmlessness labeling requires exposing raters to harmful content at scale, is expensive, and produces implicit feedback (a click, not a reason). CAI makes the alignment objective explicit and legible — 16 written principles — and scales the feedback generation step using the model itself. It is the technical foundation of Claude’s training pipeline.

How It Works

Two phases. Phase 1 (SL-CAI): show a helpful-only model a harmful prompt, have it critique its own response against a constitutional principle, revise, and repeat. Fine-tune on the final revisions. Phase 2 (RL-CAI): use an AI feedback model to generate preference labels between response pairs using the same constitutional principles; train a preference model on those labels; run PPO. The result is RLAIF (Reinforcement Learning from AI Feedback) — the human harmlessness signal is replaced entirely by AI-generated signal guided by the written constitution.

Key Sources