What It Is

Supervised Fine-Tuning (SFT) is the process of further training a pretrained language model on a curated dataset of high-quality (prompt, response) pairs using standard next-token prediction loss. It transforms a base model that predicts text continuations into an instruction-following assistant — the minimum viable step between a raw pretrained model and something deployable.

Why It Matters

SFT is the first and most accessible alignment technique. A pretrained model on its own will complete prompts in a statistical sense — given “Summarize this article:”, it may output more article text, not a summary. SFT teaches the model what “following an instruction” looks like. More critically, SFT is the prerequisite for every subsequent alignment method: RLHF requires an SFT model as the starting policy, DPO uses the SFT model to define the reference distribution. The quality of the SFT model sets the ceiling for what RLHF/DPO can recover.

How It Works

Training Mechanics

Given a dataset of (prompt, response) pairs, SFT computes cross-entropy loss only over the response tokens — prompt tokens are masked and do not contribute to the gradient. This is crucial: you’re teaching the model to produce the response given the prompt, not to memorize the prompt itself.

Input:   [PROMPT] "Explain backpropagation in one sentence."
         [RESPONSE] "Backpropagation computes gradients..."

Loss computed on:  ████████████████████████████ (response only)
Prompt masked:     ░░░░░░░░░░░░░░░░░░░░░░░░░░░ (no gradient)

The loss for each response token t at position i:

Where T is the number of response tokens, and P_θ is the model’s probability under current weights θ.

Instruction Format Design

How the prompt and response are delimited matters for generalization. Common formats:

# Alpaca-style
### Instruction:
{prompt}

### Response:
{response}

# ChatML-style (OpenAI, LLaMA-3)
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>

# Simple delimiter
Human: {prompt}
Assistant: {response}

The choice of format becomes baked into the model — you must use the same format at inference time. Models trained with ChatML format will “leak” special tokens into generations if prompted differently.

Hyperparameter Strategy

ParameterTypical ValueRationale
Learning rate1e-5 to 2e-510-100× smaller than pretraining LR
Epochs1-3More epochs → catastrophic forgetting
Warmup3-5% of stepsStabilize early training
Batch size128-256 sequencesLarger than pretraining typical
LR scheduleCosine decayMatches pretraining practice

The small learning rate is not caution — it’s necessity. The pretrained model already encodes world knowledge in its weights; aggressive updates destroy that knowledge before the model has time to learn the new behavior. This is catastrophic forgetting: the model overwrites pretrained weight configurations faster than SFT can reinforce the new ones.

Data Quality vs. Quantity

LIMA (2023) showed that 1,000 carefully selected examples can match models trained on 50,000+ noisy ones. The signal is in the diversity and quality of the demonstrations, not the count. Key properties of high-quality SFT data:

  • Diversity across tasks, domains, and response styles
  • Correctness — wrong demonstrations teach wrong behaviors
  • Format consistency — inconsistent delimiters confuse the model
  • Response length calibration — if all demonstrations are 200 tokens, the model learns to stop at 200 tokens regardless of task

InstructGPT (2022) used 13,000 demonstrations from 40 human contractors — small by dataset standards but carefully curated for quality and task coverage.

What SFT Changes vs. Pretraining

Pretraining gives the model knowledge and language skill. SFT gives it behavior. Specifically:

  • Response format: The model learns to end responses (vs. continuing forever)
  • Instruction interpretation: “Summarize” vs. “Translate” vs. “Write code” now map to distinct behaviors
  • Tone calibration: Helpful, direct, appropriately hedged
  • Task switching: A single model can handle classification, QA, summarization without prompt engineering tricks

What SFT does not change much: factual knowledge, reasoning depth, or capability limits. SFT can’t teach a model to do math it couldn’t do in pretraining — it can only teach it to try.

Why SFT Alone Isn’t Enough

SFT is fundamentally limited by its data. The model learns to imitate the distribution of demonstrations — if your demonstrations don’t cover a scenario, the model will generalize from the closest training examples, which may be wrong. Key gaps:

  1. No preference signal: SFT treats all labeled responses as equally correct. It can’t distinguish between a good summary and a great one — only between labeled and unlabeled.
  2. Coverage limits: You can’t write demonstrations for every possible prompt. The long tail of user requests will be handled by noisy generalization.
  3. No safety signal: A model trained only on helpful demonstrations hasn’t learned to refuse anything.

This is the bridge to RLHF: human preference rankings don’t require writing the ideal response, just ranking K candidates. This scales much more efficiently and captures relative quality rather than binary correct/incorrect.

Catastrophic Forgetting Risk

Fine-tuning on a narrow task distribution causes the model to forget pretrained knowledge — weights shift to minimize SFT loss, disrupting representations learned during pretraining. Mitigation strategies:

  • Low learning rate + few epochs — the standard approach
  • LoRA — freeze base weights, train only low-rank adapters; prevents forgetting entirely because base weights don’t change
  • Data mixing — include a small fraction of pretraining data in the SFT mix (InstructGPT’s PPO-ptx variant)
  • Regularization — EWC (Elastic Weight Consolidation) penalizes updates to weights that mattered in pretraining

Key Sources

  • rlhf — RLHF’s first stage is SFT; quality of SFT determines RLHF ceiling
  • dpo — DPO builds on an SFT model as its reference policy; loss is relative to SFT behavior
  • lora — efficient alternative to full-weight SFT; prevents catastrophic forgetting
  • reward-model — initialized from the SFT model with a scalar head replacing the final layer
  • alignment — SFT is the most direct but least powerful alignment technique

Open Questions

  • How much SFT data is needed before RLHF/DPO provides diminishing returns?
  • Can SFT alone (with very high-quality data at large scale) match RLHF-aligned models on safety?
  • What is the optimal data mixture for SFT across tasks, domains, and difficulty levels?
  • Does the instruction format choice (ChatML vs. Alpaca) have lasting effects on model behavior?