LIMA: Less Is More for Alignment

Concepts: sft | alignment | instruction-following | data-quality Builds on: llama-open-efficient-foundation-language-models | training-language-models-to-follow-instructions-with-human-feedback Leads to: llama-2-open-foundation-fine-tuned-chat-models | qlora-efficient-finetuning-quantized-llms

The standard recipe for a usable LLM in 2023 went like this: pretrain on a trillion tokens, then fine-tune on tens of thousands of instruction-response pairs, then run months of PPO against a reward model trained from human preference labels. Each stage was expensive, fragile, and guarded as secret sauce. The RLHF pipeline alone required maintaining four models simultaneously and a careful human labeling operation.

LIMA asks: what if almost all of that is unnecessary?

The core idea

The analogy: You want to learn academic writing. You could study 52,000 mediocre student essays — you’d absorb average conventions, common mistakes, the most-used hedging phrases. Or you could study 1,000 papers selected by an expert for being exceptionally clear, precise, and well-structured. The second approach teaches the form of good writing far more effectively — because that’s all you need. The subject knowledge came from reading widely (pretraining). What you need to learn is how to express that knowledge in the expected format.

That’s LIMA’s core claim: pretraining is where a model learns everything it knows. Fine-tuning is just teaching it how to talk.

The formal name for this claim is the Superficial Alignment Hypothesis:

“almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output”

If this is right, the exhausting RLHF pipeline is doing far less work than anyone assumed. The model’s capabilities were fixed at the end of pretraining. What RLHF adds is the style and tone of helpful responses — and style can be learned from a small, well-curated dataset.

The mechanism, step by step

Step 1: Start with a strong pretrained base.

LIMA uses LLaMA-65B. The model size matters: the hypothesis only holds if the base model already contains rich world knowledge from pretraining. A 7B model fine-tuned on the same 1,000 examples produces noticeably weaker results. The “pretraining does the work” argument requires pretraining to have actually done the work.

Step 2: Curate 1,000 examples with obsessive quality control.

The 1,000 training examples come from sources selected for response quality, diversity, and format variety — not volume:

DATA COMPOSITION: 1,000 total examples
===============================================
Stack Exchange     ~200    Top-voted answers across diverse topics
wikiHow            ~200    High-quality instructional step-by-step guides
Pushshift Reddit    ~50    Top posts from carefully selected subreddits
Manually authored  ~250    Written by the paper's authors directly
Community prompts  ~300    Curated prompts covering diverse task types

EXPLICITLY EXCLUDED:
  Alpaca-style GPT-3.5 self-instruct outputs  (quantity without quality)
  Unfiltered web scrapes                       (noisy, contradictory)
  Single-domain focused datasets               (kills generalization)
===============================================

Each example passed a quality bar: accurate, well-formatted, and covering a task type not already dominated in the dataset. The authors manually wrote 250 examples themselves — ensuring at least a quarter of the data met their exact standard.

Step 3: Standard supervised fine-tuning, nothing fancy.

No RLHF. No PPO. No reward model. Standard cross-entropy loss over response tokens, with prompt tokens masked:

$L_{SFT} = - \sum_{t = 1}^{T} lo g p_{θ} (y_{t} ∣ x, y_{< t})$

where $x$ is the prompt, $y_{t}$ is the $t$ -th response token, and $y_{< t}$ are all preceding response tokens. The prompt tokens $x$ are masked — they contribute zero gradient. The model learns only to produce the response given the context.

Training configuration: 15 epochs, cosine learning rate decay from $1 \times 1 0^{- 5}$ , batch size 64, dropout 0.1 on residual connections, linear warmup over 32 steps. At 1,000 examples × 15 epochs = 15,000 gradient steps — roughly 1-2 GPU-hours on an A100. Compared to RLHF’s weeks, this is rounding error.

TRAINING COMPARISON
===========================================
Alpaca (LLaMA 65B fine-tune):
  Data:     52,000 GPT-3.5 self-instruct outputs
  Time:     Multi-GPU days
  Result:   LIMA beats it 79% of pairwise comparisons

LIMA:
  Data:     1,000 curated examples (~10 MB of text)
  Time:     ~2 GPU-hours on A100
  Result:   Preferred over DaVinci003 (RLHF) in 65% of cases

InstructGPT / RLHF pipeline:
  Data:     13K SFT demos + 33K pref comparisons + RL training
  Time:     Weeks, multiple model training stages
  Result:   Still preferred over LIMA 57% vs GPT-4 tier
===========================================

Find the instinct: why does quality beat quantity here?

The key question: why would 1,000 curated examples outperform 52,000 generated ones?

The answer is format diversity. Alpaca generates all 52,000 examples through the same GPT-3.5 prompt template — the responses cluster around a specific style, length distribution, and phrasing pattern. LIMA draws from five sources with genuinely different formats: Stack Exchange answers look like technical explanations, wikiHow looks like numbered guides, Reddit looks like conversational advice. This format diversity teaches the model that there is no single “correct response style” — the right format depends on what the question calls for. You learn this generalization from 1,000 diverse examples better than from 52,000 homogeneous ones.

The deeper implication: RLHF’s complexity may be solving a measurement problem (how do you operationalize “helpful”?) more than a capability problem. If you can show the model what helpful looks like — with genuine diversity and quality — it generalizes without needing the reward signal.

Numeric walkthrough: what actually changes with curation

Let’s trace the impact of quality selection on a single training example.

Low-quality SFT example (Alpaca-style):

Prompt:   "Explain photosynthesis"
Response: "Photosynthesis is the process by which plants use sunlight,
           water and CO2 to produce oxygen and energy in the form of
           sugar. It is very important for life on earth."

This is correct but thin. No structure. No detail. A model trained on 52,000 like this learns to produce brief, hedge-free summaries.

High-quality SFT example (LIMA-style, from Stack Exchange):

Prompt:   "Explain photosynthesis"
Response: "Photosynthesis happens in two stages:

           1. Light-dependent reactions (in the thylakoid membrane):
              Chlorophyll absorbs photons, splitting water (H₂O) to
              release O₂ and store energy in ATP and NADPH.

           2. Calvin cycle (in the stroma):
              ATP and NADPH drive the fixation of CO₂ into G3P,
              the precursor for glucose synthesis.

           Net equation: 6CO₂ + 6H₂O + light → C₆H₁₂O₆ + 6O₂

           The key bottleneck is usually RuBisCO, the enzyme that
           fixes CO₂ — it's surprisingly slow and occasionally fixes
           O₂ instead (photorespiration), which wastes energy."

Trained on 1,000 like this, the model learns: use structure, give mechanisms, include the equation, note the interesting edge case. It generalizes this style of thoroughness to other domains — even ones not in the training set.

Results

Human evaluation (pairwise preference study):

Comparison	LIMA win	Tie	Baseline win	LIMA total (win+tie)
vs GPT-4	19%	24%	57%	43%
vs Bard	34%	24%	42%	58%
vs DaVinci003 (RLHF)	50%	15%	35%	65%
vs Alpaca 65B	79%	11%	10%	90%
vs Claude	30%	21%	49%	51%

The DaVinci003 row is the most striking: text-davinci-003 was trained with human feedback on tens of thousands of comparisons. LIMA beats it in 65% of pairwise comparisons with 1,000 SFT examples and no reward model.

Ablation: does more data help?

The paper runs LIMA variants with 2,000 and 4,000 examples. Performance improves marginally. But mixing in lower-quality data — even when it increases dataset size — hurts performance. This is the strongest evidence for quality > quantity: adding noise actively degrades the model, even when the noisy examples are plausibly correct.

“the model tends to generalize well to unseen tasks that did not appear in the training data”

This is the result that validates the Superficial Alignment Hypothesis most directly: if fine-tuning were teaching task-specific knowledge, generalization to unseen task types would be poor. Instead, LIMA generalizes broadly — because the base model already knew, and fine-tuning taught it to communicate.

What breaks

Base model dependency. The hypothesis holds for LLaMA-65B specifically. The same 1,000-example fine-tune on a 7B model gives measurably weaker results. “Pretraining does the work” only holds if pretraining was comprehensive. Smaller models may not have internalized enough knowledge to generalize from sparse SFT signals.

Safety is unaddressed. LIMA does no safety alignment. The model can be prompted to produce harmful content at rates comparable to the base LLaMA-65B. The RLHF pipeline, whatever its capability contribution, does real safety work that 1,000 curated helpful examples do not cover.

Single-turn evaluation. The human study is predominantly single-turn. Multi-turn conversation — maintaining coherence, tracking context across exchanges, graceful clarification requests — is where RLHF-trained models show larger advantages. LIMA’s multi-turn quality degrades faster than GPT-4’s as conversations extend.

Production exposure gap. The evaluation uses a curated test set. Real deployment surfaces adversarial inputs, ambiguous requests, and off-distribution queries that no 1,000-example dataset captures by design.

Practitioner notes

If you’re building ML systems, LIMA changes the calculus on alignment data collection. Before running an expensive labeling pipeline, ask: is my base model strong enough that the “pretraining did the work” assumption holds? If the base model is 13B+ with quality pretraining data, a small but carefully curated instruction set (500-2,000 examples) will outperform a large but noisy one.

Concretely: if you’re fine-tuning for a specific domain (legal analysis, medical Q&A, code review), gather 500-1,000 examples that are genuinely excellent responses in that domain. Have domain experts select or write them. Don’t generate them with GPT-4 in a self-instruct loop — that produces GPT-4’s style and introduces homogeneity that kills diversity.

What you don’t get from LIMA alone: safety robustness, strong multi-turn performance, or hardened handling of adversarial prompts. The practical synthesis post-LIMA: run small, curated SFT first (LIMA-style), then apply DPO or a lightweight RLHF pass for safety and robustness — not a massive RLHF pipeline that assumes the base model needs fundamental capability improvement. The expensive stage turns out to be doing style work, not knowledge work. Do the style work cheaply.

If your pretrained model already knows the answer, 1,000 well-chosen examples are enough to teach it how to speak.

Connections

sft — LIMA demonstrates SFT with minimal data achieves strong alignment
alignment — the Superficial Alignment Hypothesis reframes what alignment fine-tuning accomplishes
instruction-following — LIMA shows instruction-following style is learnable from 1,000 diverse examples
data-quality — the central empirical finding: data quality dominates data quantity
llama-open-efficient-foundation-language-models — the pretrained base model LIMA fine-tunes
training-language-models-to-follow-instructions-with-human-feedback — the InstructGPT RLHF pipeline that LIMA challenges
llama-2-open-foundation-fine-tuned-chat-models — Meta’s follow-up: more careful SFT curation combined with targeted RLHF
qlora-efficient-finetuning-quantized-llms — makes LIMA-style fine-tuning accessible on consumer hardware

Citation

arXiv:2305.11206

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., & Levy, O. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023. https://arxiv.org/abs/2305.11206

ML Wiki

Explorer