What It Is
Self-critique is the technique of prompting a language model to identify flaws or problems in its own output, then generate a revised version. In the Constitutional AI context, the model is shown its own response alongside a guiding principle, asked to identify ways the response violates that principle, and then asked to revise accordingly. The critique-revise pair is a form of chain-of-thought reasoning applied to alignment.
Why It Matters
Standard fine-tuning teaches a model what outputs to produce. Self-critique teaches the model why certain outputs are problematic. The distinction matters for generalization: a model that can articulate why hacking advice is harmful can apply that reasoning to novel harmful requests it was never explicitly trained on. The critique step also provides an explanation for the revision, making the training signal more interpretable than a binary preference label.
How It Works
Given a response, prompt the model with “Identify specific ways in which this response is harmful, unethical, toxic, or dangerous.” Sample a critique. Then prompt with “Rewrite the response to remove harmful content.” Sample a revision. This can be repeated N times with different constitutional principles, progressively improving harmlessness. The critique quality matters less than might be expected — even inaccurate critiques tend to produce better revisions, especially for large models.