What It Is

Knowledge distillation is a training technique where a smaller “student” model is trained to match the outputs of a larger “teacher” model, rather than training from scratch on ground-truth labels. The student learns to approximate the teacher’s behavior at a fraction of the compute and parameter cost.

Why It Matters

Distillation is the primary mechanism for creating efficient, deployable models from frontier-scale teachers. It is also increasingly important for alignment: models like DeepSeek-R1 distill reasoning capabilities from larger reasoning models. Many open-source instruction-following models (e.g., Alpaca, Vicuna) are effectively distillations of GPT-4 or GPT-3.5 via synthetic data.

How It Works

Classic distillation (Hinton et al., 2015) trains the student to minimize KL divergence between its output distribution and the teacher’s “soft” probability distribution (rather than one-hot labels). The soft labels carry more information than hard labels — they encode the teacher’s uncertainty and inter-class relationships.

In LLM contexts, distillation often takes the form of data distillation: generating synthetic training data from a teacher model and SFT-ing the student on it. This is simpler than output-distribution matching but effective in practice.

Key variants:

  • Soft label distillation — match teacher token probabilities directly
  • Data distillation — SFT on teacher-generated outputs
  • Speculative decoding — smaller draft model approximates large model token-by-token (an inference-time analog)

Key Sources

  • sft — Data distillation is implemented as SFT on teacher outputs
  • speculative-decoding — Uses a small draft model in an inference-time distillation-like role

Open Questions

  • What is the capability ceiling for distilled models relative to their teachers?
  • Can distillation transfer reasoning capabilities (not just surface behavior) from large to small models?
  • How does distillation interact with RLHF alignment — does it preserve or dilute alignment?