Harmlessness (AI alignment)

What It Is

Harmlessness is one of the three core alignment properties Anthropic targets (helpful, honest, harmless). A harmless AI declines to assist with requests that could cause physical, psychological, financial, or societal harm, and does not produce content that is toxic, abusive, or dangerous. The difficulty is that harmlessness must coexist with helpfulness — a model that refuses everything is harmless but useless.

Why It Matters

Unharmful behavior under adversarial inputs is one of the hardest alignment properties to train. Red-teaming consistently finds that RLHF models will comply with harmful requests when prompted carefully. Training for harmlessness using human labels exposes raters to harmful content and tends to produce evasive models that refuse ambiguous requests rather than reasoning about them. This evasiveness itself trades off against helpfulness.

How It Works

Constitutional AI addresses harmlessness by having the model critique and revise its own outputs against written principles, then using AI-generated preference labels for RL. The key insight: explaining an objection is more helpful than refusing, and models trained via CAI learn to engage with sensitive requests and explain their reasoning rather than stonewalling. This partially resolves the helpfulness-harmlessness tension that standard RLHF struggles with.

Key Sources

constitutional-ai-harmlessness-from-ai-feedback

ML Wiki

Explorer

Harmlessness (AI alignment)

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Harmlessness (AI alignment)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks