Concepts: rlhf | dpo | reward-model | ai-feedback | alignment Builds on: direct-preference-optimization-your-language-model-is-secretly-a-reward-model | training-language-models-to-follow-instructions-with-human-feedback | constitutional-ai-harmlessness-from-ai-feedback Leads to: (iterative DPO research)
The problem
Every RLHF system built before this paper had the same architecture: one model for generating responses, and a completely separate, frozen reward model for judging them. The reward model was trained from human preferences, frozen, and then used as a fixed signal to train the LLM. That means the ceiling on alignment quality is set by the reward model — which was set by human labelers — and stays there.
“We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal.”
If the reward model can’t improve, the LLM is training toward a fixed target. And humans are inconsistent, expensive, and, at some capability level, outpaced by the models they’re rating.
The core idea
Let’s start with an analogy. Imagine you’re preparing for a competitive essay exam. You hire a coach (the reward model) who gives you feedback. Standard approach: hire one coach, freeze them in place, keep submitting essays and taking their feedback. The problem is obvious — your coach’s judgment doesn’t improve as you get better. Once you surpass your coach, you’re training against noise.
The alternative: what if you became your own coach? You write the essay, then switch hats and grade it using a detailed rubric, then train to write better essays. As your writing improves, your grading improves too. Both abilities reinforce each other.
That’s Self-Rewarding Language Models (SRLM). The same model that generates responses also acts as the judge via LLM-as-a-Judge prompting. And crucially:
“not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself.”
Both axes improve together, across iterations. The reward model is no longer frozen — it’s the evolving LLM itself.
The mechanism, step by step:
- Start with a base LLM (here: LLaMA 2 70B)
- Fine-tune on two types of seed data:
- IFT (Instruction Fine-Tuning): ~3,200 human-written (prompt, response) pairs from Open Assistant
- EFT (Evaluation Fine-Tuning): ~1,630 (evaluation prompt, scored response) pairs teaching the model to act as a 0-5 judge
- This gives you M₁ — a model that can both answer instructions and score responses
- Self-Instruction Creation with M₁:
- Generate new prompts via few-shot sampling
- For each prompt, sample N=4 candidate responses from M₁
- Use M₁ itself to score each response 0-5 using a detailed rubric
- Build preference pairs: (highest score, lowest score) for each prompt
- Train M₂ from M₁ using DPO on these AI-Feedback pairs
- Repeat to get M₃ (using M₂’s better-calibrated scores for the next round of training data)
“The key to such an approach is to develop an agent that possesses all the abilities desired during training, rather than separating them out into distinct models such as a reward model and a language model.”
STANDARD RLHF:
Human preferences → [Reward Model] → FROZEN
↓
Prompts → [LLM Policy] → Responses → R.M. scores → PPO update → [LLM Policy v2]
(reward model never improves)
SELF-REWARDING:
Prompts → [LLM M_t] → 4 responses
↓
[Same LLM M_t as judge] → scores each response 0-5
↓
Build preference pairs: (best, worst)
↓
DPO → M_{t+1} (better at instructions AND better at judging)
↓
Repeat — the judge improves alongside the student
The LLM-as-a-Judge rubric:
The scoring prompt uses an additive 5-point rubric. The model accumulates points:
- +1 if the response is relevant at all
- +2 if it addresses a substantial portion of the question
- +3 if it answers the basic elements usefully
- +4 if clearly written from an AI assistant’s perspective, comprehensive
- +5 only for impeccably tailored, expert-level, no extraneous content
This rubric format turned out to matter. An alternative multiple-choice rubric (buckets of quality) gave pairwise accuracy of only 26.6% vs. 65.1% for the additive format.
Walkthrough with real numbers:
Say M₁ is asked “What causes a rainbow?”
It samples N=4 candidate responses and grades them:
Response 1: Basic explanation, correct but vague → Score: 3
Response 2: Detailed optics, refraction angles, diagrams → Score: 5
Response 3: Wrong (mentions reflection primarily) → Score: 1
Response 4: Correct but written casually, brief → Score: 3
Preference pair: (Response 2, Response 3)
Rejected: scores the same → no pair added (r1 = r4 = 3, discarded)
DPO trains M₂ to:
- Increase P(Response-2-style | "What causes a rainbow?")
- Decrease P(Response-3-style | "What causes a rainbow?")
After this iteration, M₂ tends to produce more detailed, accurate optics explanations — and when asked to score responses about rainbows, gives more calibrated scores. The judging ability improves because general instruction-following improves.
“the reward model itself can improve through these iterations, deviating from standard practices where the reward model is fixed.”
What’s clever — find the instinct:
Why didn’t anyone think of this earlier? Actually, they kind of did: Constitutional AI (Bai et al. 2022) used an LLM to provide feedback, but distilled that feedback into a separate, frozen reward model before using it for RL. The key assumption throughout the RLHF literature was: the reward model must be a different, separate artifact from the LLM being trained. Keeping them separate was supposed to prevent the LLM from gaming its own judge.
SRLM’s insight: that separation is what creates the ceiling. If you let the same model improve both tasks simultaneously — via DPO, which doesn’t require a separate reward model during training — you remove the bottleneck. The “gaming your own judge” risk is real but appears to be outweighed (at least over 3 iterations) by the benefit of a continuously improving signal.
Results
| Eval | M₁ | M₂ | M₃ | Comparison |
|---|---|---|---|---|
| AlpacaEval 2.0 win rate vs GPT-4 Turbo | 9.94% | 15.38% | 20.44% | Beats Claude 2 (17.19%), Gemini Pro (16.85%) |
| Reward model pairwise accuracy | 78.7% | 80.4% | 81.7% | vs. 65.1% for SFT-only baseline |
| MT-Bench (out of 10) | 6.78 | 7.01 | 7.25 | vs. 6.85 SFT baseline |
All three iterations train from the same base (LLaMA 2 70B), using only ~3,200 seed instruction examples from Open Assistant — no proprietary alignment data, no distillation from GPT-4 or Claude.
What doesn’t work:
The wins are concentrated in general instruction-following. Math, coding, and logical reasoning show much smaller gains — the Open Assistant seed prompts underemphasize these tasks, so the self-generated training data inherits that bias.
Response length grows dramatically across iterations: M₁ averages 1,092 tokens, M₂ 1,552 tokens, M₃ 2,552 tokens on AlpacaEval. Since GPT-4 judges and LLMs tend to prefer longer, more detailed responses, it’s unclear how much of the AlpacaEval win rate improvement is genuine quality versus verbosity gaming.
After 3 iterations the gains would likely saturate — if the model’s judgment can’t improve beyond a ceiling (because it’s limited by its own knowledge), then neither can the quality of the preference data it generates for itself.
So what?
If you’re building a fine-tuned model and don’t have access to large-scale human preference data, self-rewarding training is a practical path to iterative improvement. Start with a small set of high-quality seed IFT examples (LIMA showed ~1K is enough), add EFT examples teaching the model to evaluate responses, and run 2-3 DPO iterations. Each iteration generates ~4K-7K preference pairs from the model itself — cheap compared to human annotation.
The catch: this works best when your seed data’s distribution matches your target tasks. If your use case involves math or code, you need seed prompts from those domains — the model can only improve on what it can generate and evaluate.
This paper connects directly to constitutional-ai-harmlessness-from-ai-feedback — both use AI to generate feedback instead of humans. The difference is that Constitutional AI still distills that feedback into a frozen reward model, preserving the bottleneck. SRLM removes it. And it builds on direct-preference-optimization-your-language-model-is-secretly-a-reward-model — without DPO as the off-policy preference training step, running iterative training this cheaply wouldn’t be feasible. PPO would require the judge to be available at RL training time, which is computationally expensive; DPO builds the preference dataset offline and trains on it, which is cheap.
When the student teaches themselves, the ceiling is wherever their judgment can reach — not wherever their original teacher stopped.
Connections
- rlhf — SRLM addresses the core bottleneck of RLHF: the frozen reward model
- reward-model — eliminates the separate reward model by merging it into the LLM
- dpo — DPO is the off-policy training algorithm that makes iterative self-rewarding cheap
- ai-feedback — extends RLAIF by making the feedback model non-frozen
- alignment — a new training paradigm for alignment that doesn’t require large human preference datasets
- sft — M₁ is initialized via SFT on seed data before self-rewarding begins
- direct-preference-optimization-your-language-model-is-secretly-a-reward-model — the DPO backbone
- training-language-models-to-follow-instructions-with-human-feedback — InstructGPT RLHF, the baseline being improved upon
- constitutional-ai-harmlessness-from-ai-feedback — related RLAIF approach but keeps reward model frozen
Citation
Yuan, W., Pang, R. Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., & Weston, J. (2024). Self-Rewarding Language Models. arXiv preprint. https://arxiv.org/abs/2401.10020