Distilling the Knowledge in a Neural Network

Concepts: distillation | temperature-scaling | compression | ensemble-methods Builds on: model ensembling (the problem this paper compresses away) and Caruana et al. (2006) neural network compression Leads to: DINO (self-distillation) | DINOv2 | QLoRA

One label throws away almost everything the teacher knows

You’ve trained a big, expensive ensemble of neural networks on ImageNet. For a photo of a BMW, the ensemble produces this output:

BMW:           0.920
sports car:    0.045
ambulance:     0.012
garbage truck: 0.000001
carrot:        0.0000009

Now you want to train a smaller, faster model for deployment. Standard training discards all of this. It replaces the teacher’s nuanced output with a hard label: [BMW=1, everything else=0]. The information that BMWs look a bit like sports cars, slightly like ambulances, nothing like garbage trucks and even less like carrots — gone.

“The relative probabilities of incorrect answers tell us a lot about how the net tends to generalize. An image of a BMW, for example, is given a probability of 10⁻⁶ of being a garbage truck and 10⁻⁹ of being a carrot.”

That ratio is structural knowledge about the visual world. Hard labels can never express it. Hinton et al.’s central question: what if we fed the student the full distribution?

The core idea

The analogy: Imagine studying for a medical licensing exam two ways. First method: read an answer key. Second method: have an experienced doctor explain not just the right answer, but why the wrong answers are plausible — “Option B is tempting because it resembles X, but the key difference is Y.” The second carries far more information per question. You learn how the concepts relate to each other, not just which answer wins.

A neural network’s soft output distribution is that doctor’s commentary. Hard labels are just the answer key.

The mechanism, step by step:

Train a large teacher model (or ensemble) to convergence in the normal way.
Take the teacher’s raw logits and apply a temperature-scaled softmax at temperature T > 1.
Train the smaller student on these soft targets, minimizing a weighted combination:
- Cross-entropy with the teacher’s soft distribution (both evaluated at temperature T)
- Cross-entropy with the hard ground-truth labels (T=1)
After training, discard the temperature. The student infers at T=1.

HARD LABELS vs. SOFT TARGETS for a cat image:

Hard label:    cat=1.000  dog=0.000  truck=0.000    (all signal in one bit)

Teacher T=1:   cat=0.813  dog=0.181  truck=0.005    (confident, some signal)
Teacher T=2:   cat=0.643  dog=0.304  truck=0.053    (softer, more informative)
Teacher T=4:   cat=0.507  dog=0.348  truck=0.145    (structure fully visible)

Student trains at T=4, infers at T=1:
→ "cats and dogs are close neighbors; trucks are far."
→ This relational structure is now encoded in the student's weights.

The math, translated:

Temperature-scaled softmax:

$q_{i} = \frac{e ^{z_{i} / T}}{\sum _{j} e ^{z_{j} / T}}$

At T=1: standard softmax, usually very confident. At T→∞: uniform distribution. At T=3–10: useful middle ground where runner-ups reveal the similarity structure.

The distillation loss:

$L = α \cdot T^{2} \cdot H (p_{teacher}^{(T)}, p_{student}^{(T)}) + (1 - α) \cdot H (y_{hard}, p_{student}^{(1)})$

Where $H (\cdot, \cdot)$ is cross-entropy, and the $T^{2}$ factor compensates for the fact that raising T shrinks gradient magnitudes by $1/ T^{2}$ — without it, different temperatures would behave like different learning rates.

Walkthrough with actual numbers:

Teacher logits for a cat image: $z = [3.0, 1.5, - 2.0]$ (cat, dog, truck).

At T=1:
  exp([3.0, 1.5, -2.0]) = [20.09, 4.48, 0.135]  →  q = [0.813, 0.181, 0.005]
  "almost certainly cat, a bit dog-like"

At T=4:
  exp([0.75, 0.375, -0.5]) = [2.117, 1.455, 0.607]  →  q = [0.507, 0.348, 0.145]
  "cat, but dog is 35% and truck is 15% — the structure is legible"

Student soft output at T=4 (early training):
  z_student = [1.5, 1.0, 0.8]
  exp([0.375, 0.25, 0.2]) = [1.455, 1.284, 1.221]  →  q_s = [0.367, 0.324, 0.308]
  "confused — all three classes look roughly equal"

Soft-target cross-entropy:
  H = -[0.507·log(0.367) + 0.348·log(0.324) + 0.145·log(0.308)]
    = -[-0.508 - 0.392 - 0.171] = 1.071

Gradient pushes student toward [0.507, 0.348, 0.145] — not just "more cat",
but also "relatively more dog than truck."
After many steps, the student internalizes the teacher's similarity structure.

What’s clever:

“Much of the information about the learned function resides in the ratios of very small probabilities in the soft targets.”

This is the non-obvious fact. The tiny probabilities — near-zero values for wrong classes — encode structural knowledge about the input space. The difference between $1 0^{- 6}$ (garbage truck) and $1 0^{- 9}$ (carrot) for a BMW image is real signal, not noise.

Hard labels have high variance per sample — you need many examples before signal averages out. Soft targets are information-dense:

“If the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the distilled model can often be trained on much less data than the original model.”

The instinct behind the paper: you’ve already paid to train a good teacher. Its outputs contain a compressed representation of everything it learned. Throwing that away and replacing it with a one-hot vector is discarding the product of billions of FLOPs.

Does it work? What breaks?

Task	Baseline (small net)	Teacher/Ensemble	Distilled student
MNIST (test errors)	146 errors	67 errors (large net)	74 errors
Speech acoustic model	58.9% frame acc.	10-model ensemble	matches ensemble
Fine-grained classes	generalist baseline	specialist ensemble	+4.4% on confusable

The MNIST result: a small distilled network achieves 74 errors vs. the large teacher’s 67 — only 7 more, despite being substantially smaller and trained on soft targets rather than hard labels alone.

The acoustic model result is industrial: a Google production speech system. Compressing a 10-model ensemble into a single model with essentially no accuracy loss.

What doesn’t work:

Distillation requires a trained teacher — you pay the training cost upfront. It’s an inference efficiency win, not a training efficiency win.

Temperature is a sensitive hyperparameter. Too low and soft targets collapse to hard labels. Too high and the distribution becomes nearly uniform — you lose the signal about which class is most probable. T=3–10 tends to work, but it requires tuning per task.

The paper’s specialist model approach (training specialist models on subsets of confusable classes, then combining) adds engineering complexity and never became a standard practice. Plain student-from-generalist-teacher distillation is what survived.

If you’re building ML systems

Distillation is the most reliable way to shrink a model without a large accuracy drop. The recipe: train a large model until it’s as good as possible, then use its soft outputs to train a 2–10× smaller model for deployment. This consistently beats training the small model from scratch on hard labels.

The technique extends naturally: distill from an ensemble, from a model trained on more data than you currently have, or across architectures. The student needs only compatible output spaces — not the same architecture as the teacher.

This paper is the direct ancestor of every compressed LLM deployed today. The open-source reasoning distillation models — trained on outputs from large reasoning teachers — apply exactly this idea at token-level. And DINO takes it further: the teacher is an exponential moving average of the student itself, enabling distillation without any labels at all.

Knowledge distillation is why powerful ML can run on consumer hardware. A frontier model pays the training cost once. Millions of users run the distilled student.

One sentence: A neural network’s probability outputs over wrong answers are compressed structural knowledge — train a small student on those soft targets, not hard labels, and it inherits what the teacher understood about the world.

Paper: Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean — 2015

Connections

distillation — the technique introduced in this paper
temperature-scaling — softmax temperature controls how much structural information the soft targets reveal
compression — distillation is the dominant approach to compressing neural networks
ensemble-methods — distillation converts ensemble accuracy into single-model efficiency
DINO — extends to self-distillation with no labels
DINOv2 — scales self-distillation with patch-level loss
QLoRA — LLM compression lineage

Citation

arXiv:1503.02531

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS 2014 Deep Learning Workshop. https://arxiv.org/abs/1503.02531

ML Wiki

Explorer