ORPO: Monolithic Preference Optimization without Reference Model

Concepts: alignment | dpo | sft | rlhf | reward-model Builds on: direct-preference-optimization-your-language-model-is-secretly-a-reward-model | training-language-models-to-follow-instructions-with-human-feedback | proximal-policy-optimization Leads to: post-RLHF alignment research on monolithic single-phase fine-tuning

The problem

Every alignment recipe in use today is at least a two-stage process. First you fine-tune on demonstrations (SFT). Then you run a preference phase — either PPO with a reward model, or DPO with a frozen reference model — to teach the model which responses are better. That second phase requires a second copy of the model in memory. Usually a third copy too (the reference). If you’re aligning a 7B model, you’re juggling 21B parameters at once.

Why is the second phase necessary at all? Because SFT alone is broken in a specific way. When you fine-tune only on the chosen (good) responses from a preference dataset, you expect the model to learn to generate good responses. But the model doesn’t know the rejected responses exist — it’s just doing next-token prediction on the chosen text. And here’s the catch: fine-tuning on domain-specific text raises the probability of all responses in that domain, chosen or not.

The paper demonstrates this empirically: “Both the log probability of chosen and rejected responses exhibited a simultaneous increase” when training OPT-350M on chosen-only HH-RLHF data. The model learns to be more “dialogue-like” — which is correct — but it applies that to both good and bad dialogue. You need a second phase to fix the second half.

ORPO’s bet: what if you didn’t need that second phase? What if you could add a gentle “this, not that” signal directly into SFT?

The core idea

The analogy. Think about how a chef trains an apprentice. The old way: spend a month teaching classical technique (SFT), then spend another month correcting bad habits one by one (preference alignment). The problem: by the time you start correcting, some bad habits are already baked in. The better way: every time you teach a technique, you simultaneously demonstrate the right approach and point to the wrong one. “This is how you brunoise. Not that — see how the chunks are uneven?” One lesson, two signals.

ORPO does this for language models. It folds the preference signal directly into fine-tuning: for every training example with a chosen response $y_{w}$ and a rejected response $y_{l}$ , it simultaneously (1) increases the model’s probability of generating $y_{w}$ and (2) applies a mild penalty on $y_{l}$ . No second phase. No reference model.

The mechanism, step by step.

For a prompt $x$ , you have a preferred response $y_{w}$ (chosen) and a dispreferred response $y_{l}$ (rejected).
Compute the SFT loss normally: maximize log probability of $y_{w}$ token by token.
Compute the odds of generating each response: $odds_{θ} (y ∣ x) = \frac{P _{θ} ( y ∣ x )}{1 - P _{θ} ( y ∣ x )}$ . The odds tell you how much more likely the model is to generate $y$ than to not generate it.
Compute the odds ratio of $y_{w}$ over $y_{l}$ : how much more likely is the good response relative to the bad one?
Wrap the log odds ratio in a log-sigmoid to form a loss that goes to zero when the good response is decisively preferred.
Add this penalty to the SFT loss. Done.

STANDARD ALIGNMENT PIPELINE:

  Pre-trained LM ──[SFT]──► SFT model ──[DPO/PPO + reference model]──► Aligned model
                                               ↑
                                    2nd copy of model frozen here

ORPO:

  Pre-trained LM ──[SFT loss + OR penalty]──► Aligned model
                              │
                    No reference model.
                    One phase. One model in memory.

What the OR penalty does at training time:

  Chosen "The capital is Paris."     → log prob ↑  (SFT)
  Rejected "Maybe Lyon or something" → odds ratio ↓ (OR penalty, mild)
                                              ↑
                                  NOT suppressed to zero.
                                  Just a gentle contrast.

The math.

The ORPO loss is:

$L_{ORPO} = E_{(x, y_{w}, y_{l})} [L_{SFT} + λ \cdot L_{OR}]$

where $L_{SFT}$ is the standard negative log-likelihood on $y_{w}$ , and:

$L_{OR} = - lo g σ (lo g \frac{odds _{θ} ( y _{w} ∣ x )}{odds _{θ} ( y _{l} ∣ x )})$

$odds_{θ} (y ∣ x) = P_{θ} (y ∣ x) / (1 - P_{θ} (y ∣ x))$ — the odds the model generates $y$
$OR_{θ} (y_{w}, y_{l})$ — how much more likely is the good response than the bad one?
$lo g σ (\cdot)$ — wraps the log odds ratio so the loss drives toward large OR (good beats bad decisively)
$λ$ — a small weight (0.1–0.25 in experiments) that keeps the penalty mild

Crucially: no reference model $π_{ref}$ anywhere in this equation. No KL divergence against a frozen copy. The model compares chosen to rejected within the current batch, in real time.

Walkthrough with actual numbers.

Say we’re mid-training on a history question. At this point:

$P_{θ} (y_{w} ∣ x) = 0.70$ — model gives 70% probability to “The capital of France is Paris”
$P_{θ} (y_{l} ∣ x) = 0.40$ — model gives 40% probability to “I’m not sure, maybe Lyon”

Compute odds:

$odds (y_{w}) = 0.70/0.30 = 2.333$
$odds (y_{l}) = 0.40/0.60 = 0.667$

Log odds ratio: $lo g (2.333/0.667) = lo g (3.497) = 1.252$

$L_{OR} = - lo g (σ (1.252)) = - lo g (0.777) = 0.252$

$L_{SFT} = - lo g (0.70) = 0.357$ (NLL of the chosen response)

With $λ = 0.1$ : $L_{ORPO} = 0.357 + 0.025 = 0.382$

Now consider early training when the model hasn’t yet adapted to the domain: $P (y_{w}) = 0.55$ , $P (y_{l}) = 0.50$ .

$odds (y_{w}) = 0.55/0.45 = 1.222$
$odds (y_{l}) = 0.50/0.50 = 1.000$
Log OR = $lo g (1.222) = 0.201$ → $L_{OR} = - lo g (σ (0.201)) = 0.598$ ← large gradient

Same situation with DPO’s probability ratio: $lo g (0.55/0.50) = 0.095$ — four times smaller. The odds ratio is more sensitive when probabilities are close, which is exactly the situation early in training before the model has learned domain-specific patterns. It gives more signal when you need it most.

The paper makes this precise: “the odds ratio is a better choice when the preference alignment is done with SFT due to the mild discrimination of disfavored responses and the prioritizing of the favored responses to be generated.”

What’s clever.

ORPO noticed something everyone else had been treating as fixed: the SFT phase is a necessary prerequisite for DPO and PPO not because alignment algorithms require it theoretically, but because the algorithms were designed to be applied after adaptation. The reference model in DPO is just the SFT model frozen — it’s a stand-in for “the model before preference learning.” But if you’re doing SFT and preference learning simultaneously, the reference is implicit: it’s the model’s state at the previous gradient step.

The odds ratio is calibrated for this joint setting. Probability ratios (what DPO uses) have a sharp distribution — when probabilities are similar they produce very small log-ratios that don’t drive learning. Odds ratios have a wider, gentler distribution: they provide useful gradient even when probabilities are close, which is exactly the case during early SFT when the model isn’t yet domain-adapted.

The other insight: “a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT.” You don’t need to drive the rejected response probability to zero. You just need to break the symmetry — to give the model a consistent signal that one response direction is better than the other. A small $λ$ (0.1–0.25) is enough.

Results and limitations

Model	Method	AlpacaEval 2.0	MT-Bench
Mistral 7B	ORPO-β	12.20%	7.32
Mistral 7B	Zephyr-β (SFT+DPO)	10.99%	~7.3
Llama-2 7B	ORPO	9.44%	—
Llama-2 Chat 13B	RLHF	7.70%	—
Phi-2 2.7B	ORPO	6.35%	—
Phi-2 2.7B	SFT+DPO	0.78%	—

Mistral-ORPO-β (7B) beats Zephyr-β on AlpacaEval 2.0 (12.20% vs 10.99%) and matches on MT-Bench — using only UltraFeedback for one epoch, no SFT warm-up phase. Llama-2 ORPO (7B) beats Llama-2-Chat 13B on AlpacaEval 2.0 at half the parameters and without RLHF’s engineering overhead.

The paper also reports a win over DPO on ORPO vs OPT-1.3B (70.9% win rate) and over PPO (79.4% win rate) on HH-RLHF — with OPT-125M through 1.3B.

What doesn’t work. ORPO’s per-input diversity is lower than DPO’s — the model becomes more decisive but less explorative. That’s a known tradeoff: sharp preference learning concentrates probability mass. For applications where response diversity matters (sampling-based ensembles, best-of-N selection), DPO’s smoother logit distribution may be preferable.

ORPO still needs paired preference data $(y_{w}, y_{l})$ for each prompt — it doesn’t solve the data collection problem that KTO addressed. And the paper’s experiments go up to 7B; it’s an open question whether the odds ratio penalty holds its advantage at 70B+, where SFT and domain adaptation dynamics are different.

“fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters.” That claim holds for AlpacaEval — it doesn’t extend to every benchmark, and MT-Bench results are comparable rather than clearly superior.

If you’re building ML systems

If you’re aligning a model and care about GPU memory and training simplicity, ORPO is the first thing to reach for. You skip the SFT warm-up, remove the reference model copy, and run one training loop instead of two. The implementation is a standard SFT loop with one additional loss term — about 10 lines of code added to a typical HuggingFace training script.

The practical limitation to watch: ORPO’s low per-input diversity. If your downstream task benefits from sampling diverse completions and picking the best (code generation, math, structured outputs), run an ablation comparing ORPO to DPO on your specific task before committing. For chat/instruction-following where you want consistent, on-format responses, ORPO’s sharper concentration is a feature, not a bug.

ORPO sits in the same lineage as direct-preference-optimization-your-language-model-is-secretly-a-reward-model and kto-model-alignment-prospect-theoretic-optimization — all three are trying to simplify the alignment pipeline by eliminating pieces that turned out to be unnecessary. DPO removed the reward model. KTO removed paired preferences. ORPO removes the reference model and the separate SFT phase. The pattern: every “necessary” component of RLHF has turned out to be necessary only given the constraints of the original design, not the problem itself.

One-liner: SFT already does 99% of alignment work — ORPO adds a single odds-ratio term to handle the other 1%, eliminating the reference model and the separate preference phase entirely.

Connections

alignment — ORPO contributes a monolithic SFT+alignment approach, eliminating the two-phase pipeline
dpo — ORPO replaces DPO’s probability ratio and reference model with an odds ratio directly in SFT
sft — ORPO shows SFT is the right base — preference alignment should be folded in, not added after
rlhf — ORPO eliminates both the reward model and the RL phase entirely
reward-model — Not needed in ORPO; the odds ratio contrast replaces reward model scoring

Citation

Hong, J., Lee, N., & Thorne, J. (2024). ORPO: Monolithic Preference Optimization without Reference Model. KAIST AI. arXiv:2403.07691.

ML Wiki

Explorer