What It Is
Harmlessness is one of the three core alignment properties Anthropic targets (helpful, honest, harmless). A harmless AI declines to assist with requests that could cause physical, psychological, financial, or societal harm, and does not produce content that is toxic, abusive, or dangerous. The difficulty is that harmlessness must coexist with helpfulness — a model that refuses everything is harmless but useless.
Why It Matters
Unharmful behavior under adversarial inputs is one of the hardest alignment properties to train. Red-teaming consistently finds that RLHF models will comply with harmful requests when prompted carefully. Training for harmlessness using human labels exposes raters to harmful content and tends to produce evasive models that refuse ambiguous requests rather than reasoning about them. This evasiveness itself trades off against helpfulness.
How It Works
Constitutional AI addresses harmlessness by having the model critique and revise its own outputs against written principles, then using AI-generated preference labels for RL. The key insight: explaining an objection is more helpful than refusing, and models trained via CAI learn to engage with sensitive requests and explain their reasoning rather than stonewalling. This partially resolves the helpfulness-harmlessness tension that standard RLHF struggles with.