SFT (Supervised Fine-Tuning)

What It Is

Supervised Fine-Tuning (SFT) is the process of further training a pretrained language model on a curated dataset of high-quality (prompt, response) pairs using standard next-token prediction loss. It transforms a base model that predicts text continuations into an instruction-following assistant — the minimum viable step between a raw pretrained model and something deployable.

Why It Matters

SFT is the first and most accessible alignment technique. A pretrained model on its own will complete prompts in a statistical sense — given “Summarize this article:”, it may output more article text, not a summary. SFT teaches the model what “following an instruction” looks like. More critically, SFT is the prerequisite for every subsequent alignment method: RLHF requires an SFT model as the starting policy, DPO uses the SFT model to define the reference distribution. The quality of the SFT model sets the ceiling for what RLHF/DPO can recover.

How It Works

Training Mechanics

Given a dataset of (prompt, response) pairs, SFT computes cross-entropy loss only over the response tokens — prompt tokens are masked and do not contribute to the gradient. This is crucial: you’re teaching the model to produce the response given the prompt, not to memorize the prompt itself.

Input:   [PROMPT] "Explain backpropagation in one sentence."
         [RESPONSE] "Backpropagation computes gradients..."

Loss computed on:  ████████████████████████████ (response only)
Prompt masked:     ░░░░░░░░░░░░░░░░░░░░░░░░░░░ (no gradient)

The loss for each response token t at position i:

$L = - \frac{1}{T} \sum_{i = 1}^{T} lo g P_{θ} (t_{i} ∣ prompt, t_{1}, \dots, t_{i - 1})$

Where T is the number of response tokens, and P_θ is the model’s probability under current weights θ.

Instruction Format Design

How the prompt and response are delimited matters for generalization. Common formats:

# Alpaca-style
### Instruction:
{prompt}

### Response:
{response}

# ChatML-style (OpenAI, LLaMA-3)
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>

# Simple delimiter
Human: {prompt}
Assistant: {response}

The choice of format becomes baked into the model — you must use the same format at inference time. Models trained with ChatML format will “leak” special tokens into generations if prompted differently.

Hyperparameter Strategy

Parameter	Typical Value	Rationale
Learning rate	1e-5 to 2e-5	10-100× smaller than pretraining LR
Epochs	1-3	More epochs → catastrophic forgetting
Warmup	3-5% of steps	Stabilize early training
Batch size	128-256 sequences	Larger than pretraining typical
LR schedule	Cosine decay	Matches pretraining practice

The small learning rate is not caution — it’s necessity. The pretrained model already encodes world knowledge in its weights; aggressive updates destroy that knowledge before the model has time to learn the new behavior. This is catastrophic forgetting: the model overwrites pretrained weight configurations faster than SFT can reinforce the new ones.

Data Quality vs. Quantity

LIMA (2023) showed that 1,000 carefully selected examples can match models trained on 50,000+ noisy ones. The signal is in the diversity and quality of the demonstrations, not the count. Key properties of high-quality SFT data:

Diversity across tasks, domains, and response styles
Correctness — wrong demonstrations teach wrong behaviors
Format consistency — inconsistent delimiters confuse the model
Response length calibration — if all demonstrations are 200 tokens, the model learns to stop at 200 tokens regardless of task

InstructGPT (2022) used 13,000 demonstrations from 40 human contractors — small by dataset standards but carefully curated for quality and task coverage.

What SFT Changes vs. Pretraining

Pretraining gives the model knowledge and language skill. SFT gives it behavior. Specifically:

Response format: The model learns to end responses (vs. continuing forever)
Instruction interpretation: “Summarize” vs. “Translate” vs. “Write code” now map to distinct behaviors
Tone calibration: Helpful, direct, appropriately hedged
Task switching: A single model can handle classification, QA, summarization without prompt engineering tricks

What SFT does not change much: factual knowledge, reasoning depth, or capability limits. SFT can’t teach a model to do math it couldn’t do in pretraining — it can only teach it to try.

Why SFT Alone Isn’t Enough

SFT is fundamentally limited by its data. The model learns to imitate the distribution of demonstrations — if your demonstrations don’t cover a scenario, the model will generalize from the closest training examples, which may be wrong. Key gaps:

No preference signal: SFT treats all labeled responses as equally correct. It can’t distinguish between a good summary and a great one — only between labeled and unlabeled.
Coverage limits: You can’t write demonstrations for every possible prompt. The long tail of user requests will be handled by noisy generalization.
No safety signal: A model trained only on helpful demonstrations hasn’t learned to refuse anything.

This is the bridge to RLHF: human preference rankings don’t require writing the ideal response, just ranking K candidates. This scales much more efficiently and captures relative quality rather than binary correct/incorrect.

Catastrophic Forgetting Risk

Fine-tuning on a narrow task distribution causes the model to forget pretrained knowledge — weights shift to minimize SFT loss, disrupting representations learned during pretraining. Mitigation strategies:

Low learning rate + few epochs — the standard approach
LoRA — freeze base weights, train only low-rank adapters; prevents forgetting entirely because base weights don’t change
Data mixing — include a small fraction of pretraining data in the SFT mix (InstructGPT’s PPO-ptx variant)
Regularization — EWC (Elastic Weight Consolidation) penalizes updates to weights that mattered in pretraining

Key Sources

training-language-models-to-follow-instructions-with-human-feedback — InstructGPT’s Stage 1 is SFT; 13K demonstrations, 16 epochs, cosine LR
direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO paper, which uses SFT as the reference policy
lima-less-is-more-for-alignment — demonstrates 1,000 curated SFT examples outperform 52,000 noisy ones; the Superficial Alignment Hypothesis
llama-2-open-foundation-fine-tuned-chat-models
learning-to-summarize-human-feedback — foundational RLHF paper; SFT is Stage 1, providing the initialization for the reward model and policy
orpo-monolithic-preference-optimization — shows SFT alone raises log prob of rejected responses; ORPO fixes this by adding an odds ratio penalty directly into SFT
self-rewarding-language-models

rlhf — RLHF’s first stage is SFT; quality of SFT determines RLHF ceiling
dpo — DPO builds on an SFT model as its reference policy; loss is relative to SFT behavior
lora — efficient alternative to full-weight SFT; prevents catastrophic forgetting
reward-model — initialized from the SFT model with a scalar head replacing the final layer
alignment — SFT is the most direct but least powerful alignment technique

Open Questions

How much SFT data is needed before RLHF/DPO provides diminishing returns?
Can SFT alone (with very high-quality data at large scale) match RLHF-aligned models on safety?
What is the optimal data mixture for SFT across tasks, domains, and difficulty levels?
Does the instruction format choice (ChatML vs. Alpaca) have lasting effects on model behavior?

ML Wiki

Explorer

SFT (Supervised Fine-Tuning)

What It Is

Why It Matters

How It Works

Training Mechanics

Instruction Format Design

Hyperparameter Strategy

Data Quality vs. Quantity

What SFT Changes vs. Pretraining

Why SFT Alone Isn’t Enough

Catastrophic Forgetting Risk

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

SFT (Supervised Fine-Tuning)

What It Is

Why It Matters

How It Works

Training Mechanics

Instruction Format Design

Hyperparameter Strategy

Data Quality vs. Quantity

What SFT Changes vs. Pretraining

Why SFT Alone Isn’t Enough

Catastrophic Forgetting Risk

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks