Distillation (Knowledge Distillation)

What It Is

Knowledge distillation is a training technique where a smaller “student” model is trained to match the outputs of a larger “teacher” model, rather than training from scratch on ground-truth labels. The student learns to approximate the teacher’s behavior at a fraction of the compute and parameter cost.

Why It Matters

Distillation is the primary mechanism for creating efficient, deployable models from frontier-scale teachers. It is also increasingly important for alignment: models like DeepSeek-R1 distill reasoning capabilities from larger reasoning models. Many open-source instruction-following models (e.g., Alpaca, Vicuna) are effectively distillations of GPT-4 or GPT-3.5 via synthetic data.

How It Works

Classic distillation (Hinton et al., 2015) trains the student to minimize KL divergence between its output distribution and the teacher’s “soft” probability distribution (rather than one-hot labels). The soft labels carry more information than hard labels — they encode the teacher’s uncertainty and inter-class relationships.

In LLM contexts, distillation often takes the form of data distillation: generating synthetic training data from a teacher model and SFT-ing the student on it. This is simpler than output-distribution matching but effective in practice.

Key variants:

Soft label distillation — match teacher token probabilities directly
Data distillation — SFT on teacher-generated outputs
Speculative decoding — smaller draft model approximates large model token-by-token (an inference-time analog)

Key Sources

direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO discusses the connection between preference optimization and behavioral cloning/distillation
dino-self-supervised-vision-transformers — DINO: self-distillation with no labels, student matches momentum teacher
falcon-perception-vlm
knowledge-distillation-hinton
dinov2-learning-robust-visual-features — DINOv2: self-distillation (DINO loss) plus distillation of smaller ViTs from frozen ViT-g teacher

sft — Data distillation is implemented as SFT on teacher outputs
speculative-decoding — Uses a small draft model in an inference-time distillation-like role

Open Questions

What is the capability ceiling for distilled models relative to their teachers?
Can distillation transfer reasoning capabilities (not just surface behavior) from large to small models?
How does distillation interact with RLHF alignment — does it preserve or dilute alignment?

ML Wiki

Explorer

Distillation (Knowledge Distillation)

What It Is

Why It Matters

How It Works

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Distillation (Knowledge Distillation)

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks