What It Is

Ensemble methods combine predictions from multiple independently trained models. The simplest form averages their output probabilities. Ensembles almost always outperform any single constituent model because their errors are uncorrelated — where one model is wrong, others are often right.

Why It Matters

Ensembles set the performance ceiling that distillation tries to match in a single model. In practice, deploying 10 models costs 10× the inference compute — so compressing ensemble knowledge into a single student model is the primary motivation for knowledge distillation.

How It Works

Train N models with different random seeds, data orderings, or hyperparameters. At inference, average their logits (or probabilities). The averaging reduces variance from individual models while preserving the bias of the collective. Specialist ensembles (models trained on subsets of confusable classes) can further improve fine-grained accuracy.

Key Sources

  • distillation — knowledge distillation compresses ensemble knowledge into one model
  • transfer-learning — ensemble outputs can serve as rich transfer targets