Concepts: sft | alignment | instruction-following | data-quality Builds on: llama-open-efficient-foundation-language-models | training-language-models-to-follow-instructions-with-human-feedback Leads to: llama-2-open-foundation-fine-tuned-chat-models | qlora-efficient-finetuning-quantized-llms
The standard recipe for a usable LLM in 2023 went like this: pretrain on a trillion tokens, then fine-tune on tens of thousands of instruction-response pairs, then run months of PPO against a reward model trained from human preference labels. Each stage was expensive, fragile, and guarded as secret sauce. The RLHF pipeline alone required maintaining four models simultaneously and a careful human labeling operation.
LIMA asks: what if almost all of that is unnecessary?
The core idea
The analogy: You want to learn academic writing. You could study 52,000 mediocre student essays — you’d absorb average conventions, common mistakes, the most-used hedging phrases. Or you could study 1,000 papers selected by an expert for being exceptionally clear, precise, and well-structured. The second approach teaches the form of good writing far more effectively — because that’s all you need. The subject knowledge came from reading widely (pretraining). What you need to learn is how to express that knowledge in the expected format.
That’s LIMA’s core claim: pretraining is where a model learns everything it knows. Fine-tuning is just teaching it how to talk.
The formal name for this claim is the Superficial Alignment Hypothesis:
“almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output”
If this is right, the exhausting RLHF pipeline is doing far less work than anyone assumed. The model’s capabilities were fixed at the end of pretraining. What RLHF adds is the style and tone of helpful responses — and style can be learned from a small, well-curated dataset.
The mechanism, step by step
Step 1: Start with a strong pretrained base.
LIMA uses LLaMA-65B. The model size matters: the hypothesis only holds if the base model already contains rich world knowledge from pretraining. A 7B model fine-tuned on the same 1,000 examples produces noticeably weaker results. The “pretraining does the work” argument requires pretraining to have actually done the work.
Step 2: Curate 1,000 examples with obsessive quality control.
The 1,000 training examples come from sources selected for response quality, diversity, and format variety — not volume:
DATA COMPOSITION: 1,000 total examples
===============================================
Stack Exchange ~200 Top-voted answers across diverse topics
wikiHow ~200 High-quality instructional step-by-step guides
Pushshift Reddit ~50 Top posts from carefully selected subreddits
Manually authored ~250 Written by the paper's authors directly
Community prompts ~300 Curated prompts covering diverse task types
EXPLICITLY EXCLUDED:
Alpaca-style GPT-3.5 self-instruct outputs (quantity without quality)
Unfiltered web scrapes (noisy, contradictory)
Single-domain focused datasets (kills generalization)
===============================================
Each example passed a quality bar: accurate, well-formatted, and covering a task type not already dominated in the dataset. The authors manually wrote 250 examples themselves — ensuring at least a quarter of the data met their exact standard.
Step 3: Standard supervised fine-tuning, nothing fancy.
No RLHF. No PPO. No reward model. Standard cross-entropy loss over response tokens, with prompt tokens masked:
where is the prompt, is the -th response token, and are all preceding response tokens. The prompt tokens are masked — they contribute zero gradient. The model learns only to produce the response given the context.
Training configuration: 15 epochs, cosine learning rate decay from , batch size 64, dropout 0.1 on residual connections, linear warmup over 32 steps. At 1,000 examples × 15 epochs = 15,000 gradient steps — roughly 1-2 GPU-hours on an A100. Compared to RLHF’s weeks, this is rounding error.
TRAINING COMPARISON
===========================================
Alpaca (LLaMA 65B fine-tune):
Data: 52,000 GPT-3.5 self-instruct outputs
Time: Multi-GPU days
Result: LIMA beats it 79% of pairwise comparisons
LIMA:
Data: 1,000 curated examples (~10 MB of text)
Time: ~2 GPU-hours on A100
Result: Preferred over DaVinci003 (RLHF) in 65% of cases
InstructGPT / RLHF pipeline:
Data: 13K SFT demos + 33K pref comparisons + RL training
Time: Weeks, multiple model training stages
Result: Still preferred over LIMA 57% vs GPT-4 tier
===========================================
Find the instinct: why does quality beat quantity here?
The key question: why would 1,000 curated examples outperform 52,000 generated ones?
The answer is format diversity. Alpaca generates all 52,000 examples through the same GPT-3.5 prompt template — the responses cluster around a specific style, length distribution, and phrasing pattern. LIMA draws from five sources with genuinely different formats: Stack Exchange answers look like technical explanations, wikiHow looks like numbered guides, Reddit looks like conversational advice. This format diversity teaches the model that there is no single “correct response style” — the right format depends on what the question calls for. You learn this generalization from 1,000 diverse examples better than from 52,000 homogeneous ones.
The deeper implication: RLHF’s complexity may be solving a measurement problem (how do you operationalize “helpful”?) more than a capability problem. If you can show the model what helpful looks like — with genuine diversity and quality — it generalizes without needing the reward signal.
Numeric walkthrough: what actually changes with curation
Let’s trace the impact of quality selection on a single training example.
Low-quality SFT example (Alpaca-style):
Prompt: "Explain photosynthesis"
Response: "Photosynthesis is the process by which plants use sunlight,
water and CO2 to produce oxygen and energy in the form of
sugar. It is very important for life on earth."
This is correct but thin. No structure. No detail. A model trained on 52,000 like this learns to produce brief, hedge-free summaries.
High-quality SFT example (LIMA-style, from Stack Exchange):
Prompt: "Explain photosynthesis"
Response: "Photosynthesis happens in two stages:
1. Light-dependent reactions (in the thylakoid membrane):
Chlorophyll absorbs photons, splitting water (H₂O) to
release O₂ and store energy in ATP and NADPH.
2. Calvin cycle (in the stroma):
ATP and NADPH drive the fixation of CO₂ into G3P,
the precursor for glucose synthesis.
Net equation: 6CO₂ + 6H₂O + light → C₆H₁₂O₆ + 6O₂
The key bottleneck is usually RuBisCO, the enzyme that
fixes CO₂ — it's surprisingly slow and occasionally fixes
O₂ instead (photorespiration), which wastes energy."
Trained on 1,000 like this, the model learns: use structure, give mechanisms, include the equation, note the interesting edge case. It generalizes this style of thoroughness to other domains — even ones not in the training set.
Results
Human evaluation (pairwise preference study):
| Comparison | LIMA win | Tie | Baseline win | LIMA total (win+tie) |
|---|---|---|---|---|
| vs GPT-4 | 19% | 24% | 57% | 43% |
| vs Bard | 34% | 24% | 42% | 58% |
| vs DaVinci003 (RLHF) | 50% | 15% | 35% | 65% |
| vs Alpaca 65B | 79% | 11% | 10% | 90% |
| vs Claude | 30% | 21% | 49% | 51% |
The DaVinci003 row is the most striking: text-davinci-003 was trained with human feedback on tens of thousands of comparisons. LIMA beats it in 65% of pairwise comparisons with 1,000 SFT examples and no reward model.
Ablation: does more data help?
The paper runs LIMA variants with 2,000 and 4,000 examples. Performance improves marginally. But mixing in lower-quality data — even when it increases dataset size — hurts performance. This is the strongest evidence for quality > quantity: adding noise actively degrades the model, even when the noisy examples are plausibly correct.
“the model tends to generalize well to unseen tasks that did not appear in the training data”
This is the result that validates the Superficial Alignment Hypothesis most directly: if fine-tuning were teaching task-specific knowledge, generalization to unseen task types would be poor. Instead, LIMA generalizes broadly — because the base model already knew, and fine-tuning taught it to communicate.
What breaks
Base model dependency. The hypothesis holds for LLaMA-65B specifically. The same 1,000-example fine-tune on a 7B model gives measurably weaker results. “Pretraining does the work” only holds if pretraining was comprehensive. Smaller models may not have internalized enough knowledge to generalize from sparse SFT signals.
Safety is unaddressed. LIMA does no safety alignment. The model can be prompted to produce harmful content at rates comparable to the base LLaMA-65B. The RLHF pipeline, whatever its capability contribution, does real safety work that 1,000 curated helpful examples do not cover.
Single-turn evaluation. The human study is predominantly single-turn. Multi-turn conversation — maintaining coherence, tracking context across exchanges, graceful clarification requests — is where RLHF-trained models show larger advantages. LIMA’s multi-turn quality degrades faster than GPT-4’s as conversations extend.
Production exposure gap. The evaluation uses a curated test set. Real deployment surfaces adversarial inputs, ambiguous requests, and off-distribution queries that no 1,000-example dataset captures by design.
Practitioner notes
If you’re building ML systems, LIMA changes the calculus on alignment data collection. Before running an expensive labeling pipeline, ask: is my base model strong enough that the “pretraining did the work” assumption holds? If the base model is 13B+ with quality pretraining data, a small but carefully curated instruction set (500-2,000 examples) will outperform a large but noisy one.
Concretely: if you’re fine-tuning for a specific domain (legal analysis, medical Q&A, code review), gather 500-1,000 examples that are genuinely excellent responses in that domain. Have domain experts select or write them. Don’t generate them with GPT-4 in a self-instruct loop — that produces GPT-4’s style and introduces homogeneity that kills diversity.
What you don’t get from LIMA alone: safety robustness, strong multi-turn performance, or hardened handling of adversarial prompts. The practical synthesis post-LIMA: run small, curated SFT first (LIMA-style), then apply DPO or a lightweight RLHF pass for safety and robustness — not a massive RLHF pipeline that assumes the base model needs fundamental capability improvement. The expensive stage turns out to be doing style work, not knowledge work. Do the style work cheaply.
If your pretrained model already knows the answer, 1,000 well-chosen examples are enough to teach it how to speak.
Connections
- sft — LIMA demonstrates SFT with minimal data achieves strong alignment
- alignment — the Superficial Alignment Hypothesis reframes what alignment fine-tuning accomplishes
- instruction-following — LIMA shows instruction-following style is learnable from 1,000 diverse examples
- data-quality — the central empirical finding: data quality dominates data quantity
- llama-open-efficient-foundation-language-models — the pretrained base model LIMA fine-tunes
- training-language-models-to-follow-instructions-with-human-feedback — the InstructGPT RLHF pipeline that LIMA challenges
- llama-2-open-foundation-fine-tuned-chat-models — Meta’s follow-up: more careful SFT curation combined with targeted RLHF
- qlora-efficient-finetuning-quantized-llms — makes LIMA-style fine-tuning accessible on consumer hardware
Citation
Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., & Levy, O. (2023). LIMA: Less Is More for Alignment. NeurIPS 2023. https://arxiv.org/abs/2305.11206