Concepts: pre-training | rlhf | sft | alignment | gqa Builds on: proximal-policy-optimization | training-language-models-to-follow-instructions-with-human-feedback Leads to: qlora-efficient-finetuning-quantized-llms
The year is 2023 and every open-source chat model is a thin wrapper over an instruction-tuned base: take a pretrained LLM, add a few thousand prompt-response examples, maybe a round of RLHF from InstructGPT’s recipe, and ship. The results were capable but brittle — models that forgot your instructions after a few turns, conflated helpfulness with harmlessness, and had no documented safety process whatsoever.
Meta’s Llama 2 paper is unusual in ML publishing. It’s less a single technique than a full engineering playbook: here is every decision we made, every ablation we ran, every annotation protocol we used, every failure mode we hit. The 77-page paper is the most detailed public account of how a competitive chat LLM is actually built.
The core idea
The analogy. Think of building a customer service rep from scratch. First, you hire someone smart with broad knowledge (pretraining). Then you show them transcripts of excellent customer interactions so they learn the communication style (SFT). Next, you have supervisors compare pairs of their responses — “this reply was better than that one” — and use those ratings to build a scoring rubric (reward model). Finally, you run a continuous improvement cycle: the rep drafts responses, the rubric scores them, and the rep iterates toward higher scores (PPO).
Here’s the catch no earlier work solved cleanly: helpfulness and safety pull in opposite directions. A rep who always helps — even with “how do I make a bomb?” — scores perfectly on helpfulness but catastrophically on safety. Meta’s solution: train two separate rubrics, one for each dimension, and combine them with a priority rule.
The mechanism, step by step:
-
Pretrain on 2 trillion tokens of public web data. Use RMSNorm, SwiGLU activations, RoPE positional embeddings, and a 4,096-token context window (double LLaMA 1’s 2,048). For 34B and 70B models, add grouped-query attention (GQA) to make inference cheaper.
-
SFT on 27,540 high-quality examples. Quality beats quantity. Meta collected expert-written prompt-response pairs and threw out millions of third-party examples. Two epochs, lr=, loss computed only on response tokens (not the prompt).
-
Build two reward models. A Helpfulness RM and a Safety RM, both initialized from the SFT checkpoint. Train each with a margin-augmented binary ranking loss:
where is the preferred response, is the rejected one, and is a margin that widens when annotators were more certain (e.g., “significantly better” → larger ). Collect 1.4 million binary comparisons internally.
-
Rejection sampling fine-tuning. Sample completions from the 70B model for each prompt, score with the reward model, keep the best. Fine-tune on those best responses. Smaller models are distilled from the 70B’s best outputs.
-
PPO with a combined reward function. The final reward is a piecewise combination:
Safety scores below 0.15 immediately override the helpfulness score. The policy optimizes:
The KL penalty ( for 7B/13B, for 34B/70B) stops the model from drifting too far from the pretrained policy — preventing reward hacking.
- Ghost Attention (GAtt) for multi-turn consistency. Models trained with RLHF kept forgetting system-level instructions after a few turns. GAtt’s fix: during training, synthetically prepend the system instruction to every user message in the conversation, sample responses from the current RLHF model, then train on those — but zero out the loss on all previous turns. The model learns to behave as if it sees the instruction at every turn, even when it’s only present at the start.
LLAMA 2 TRAINING PIPELINE
2T tokens of web data
│
┌─────▼──────┐
│ Pretrain │ Transformer + RMSNorm + SwiGLU + RoPE
│ 7B/13B/ │ 4K context, GQA for 34B/70B
│ 34B/70B │
└─────┬──────┘
│
┌─────▼──────┐
│ SFT │ 27,540 expert examples, quality > quantity
│ │ Loss only on response tokens, 2 epochs
└─────┬──────┘
│
┌─────▼──────────────────────────────────┐
│ RLHF (iterative) │
│ │
│ ┌───────────┐ ┌───────────────────┐ │
│ │Helpfulness│ │ Safety RM │ │
│ │ RM │ │ (safety-first, │ │
│ │ │ │ 0.15 threshold) │ │
│ └─────┬─────┘ └────────┬──────────┘ │
│ └────────┬─────────┘ │
│ ┌────▼─────┐ │
│ │ Combined │ │
│ │ Reward │ │
│ └────┬─────┘ │
│ │ │
│ Rejection Sampling (70B → distill) │
│ + PPO (KL-penalized) │
│ + GAtt (multi-turn consistency) │
└─────────────────┬──────────────────────┘
│
Llama 2-Chat
(7B / 13B / 34B / 70B)
What’s clever — find the instinct.
Why two reward models? Earlier RLHF papers tried to handle helpfulness and safety in a single model. But safety and helpfulness anti-correlate on adversarial prompts: a helpful answer to “how do I pick a lock?” is an unsafe one. A single reward model trained on a mix of these annotations learns an average that satisfies neither objective well.
The insight is to treat this as a constrained optimization problem: maximize helpfulness subject to the constraint that safety exceeds a threshold. The piecewise reward function is exactly that — a gate plus an objective.
“We train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM)… we define to be a piecewise combination of the safety and helpfulness reward models.”
Ghost Attention was born from a concrete failure: “our initial RLHF models tended to forget the initial instruction after a few turns of dialogue.” The fix is a data-formatting trick — prepend the system instruction everywhere in training, zero out losses on prior turns, and the model internalizes it:
“GAtt enables dialogue control over multiple turns… we synthetically concatenate this instruction to all the user messages of the conversation.”
Numeric walkthrough: reward model loss.
An annotator sees two responses to “Write a poem about Paris”:
- Response A (, preferred): creative poem — marked “significantly better”
- Response B (, rejected): generic description
Margin for “significantly better”: . Reward model outputs , .
Now a nearly-equal pair (margin ): , .
The model is penalized more when it can’t distinguish nearly-equal responses. This is intentional: close pairs are informationally weak, and the margin forces the model to be appropriately uncertain.
Results
Llama 2 pretrained base vs open-source (70B scale):
| Model | MMLU | GSM8K | BBH | HumanEval |
|---|---|---|---|---|
| Falcon 40B | 55.4 | 12.6 | 37.1 | 15.2 |
| LLaMA 1 65B | 63.4 | 30.8 | 43.5 | 30.7 |
| Llama 2 70B | 68.9 | 35.2 | 51.2 | 37.5 |
Llama 2 70B vs closed-source:
| Model | MMLU | GSM8K | HumanEval |
|---|---|---|---|
| PaLM 540B | 69.3 | 56.5 | 26.2 |
| GPT-3.5 | 70.0 | 57.1 | 48.1 |
| Llama 2 70B | 68.9 | 56.8 | 29.9 |
| GPT-4 | 86.4 | 92.0 | 67.0 |
Llama 2 70B essentially matches GPT-3.5 on MMLU and GSM8K. The code gap is large (HumanEval: 29.9 vs 48.1), and GPT-4 is a different class entirely.
What doesn’t work:
Coding is the glaring weakness — the pretraining data mix had less code than OpenAI’s. HumanEval 29.9 vs 48.1 for GPT-3.5 is a 60% relative gap.
Rejection sampling before PPO introduces silent forgetting. RLHF V3 improved overall but “struggled more than previous versions to compose rhyming lines in poems.” Capabilities from earlier checkpoints erode when you only sample from the most recent model. The fix — keeping top samples from all prior iterations — works but adds bookkeeping complexity.
The paper is also candid about hallucination: models confidently state false information, especially on obscure topics, despite the factual up-sampling during pretraining.
Practitioner notes
The highest-leverage insight for SFT is quality beats quantity. Meta collected 27,540 examples and discarded millions of lower-quality ones. The LIMA paper independently found the same with 1,000 examples. A hundred expert-written examples beat a thousand crowd-sourced ones for teaching communication style.
The two-reward-model setup is worth adopting when your application has a hard safety gate — medical, legal, finance. Use a domain-specific safety scorer as a threshold filter, and a quality scorer as the objective. Don’t try to train a single model for both: the signals conflict on adversarial inputs.
Ghost Attention is a free win for multi-turn consistency. Concatenating the system prompt to every user turn in your training data, then zeroing losses on prior turns, costs nothing at inference time. No architectural change. Just data formatting.
This paper sits at the transition point where “open-source LLM” stopped meaning “weaker but available” and started meaning “competitive.” The next papers to read — qlora-efficient-finetuning-quantized-llms for fitting a 70B into a single GPU, and proximal-policy-optimization for the PPO algorithm underlying the RLHF stage — complete the picture.
A fully documented, iterative RLHF pipeline with two reward models and rejection sampling can produce a 70B model that matches GPT-3.5 on MMLU and GSM8K — and publishing every detail is a strategic moat, not a giveaway.
Connections
- pre-training — Llama 2 base trained on 2T tokens with autoregressive next-token prediction
- sft — 27,540 high-quality expert examples; quality over quantity is the key finding
- rlhf — full iterative RLHF pipeline: rejection sampling + PPO + iterative reward model updates
- alignment — two-reward-model setup (helpfulness + safety) with 0.15 priority threshold
- gqa — grouped-query attention for 34B/70B models reduces KV cache memory at inference
- reward-model — two separate RMs trained on 1.4M binary comparisons with margin loss
- ppo — KL-penalized PPO, (small) and (large models)
- proximal-policy-optimization — PPO algorithm used in the RLHF stage
- training-language-models-to-follow-instructions-with-human-feedback — InstructGPT: the RLHF recipe Llama 2 extended
- lima-less-is-more-for-alignment — parallel finding: SFT quality beats quantity
Citation
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint. https://arxiv.org/abs/2307.09288