What It Is

Temperature scaling divides a neural network’s raw logits by a scalar T before applying softmax. At T=1 you get the standard softmax. At T>1 the output distribution becomes softer (more spread out, higher entropy). At T<1 the distribution becomes sharper (more confident, closer to one-hot).

Why It Matters

Temperature is the key control knob in knowledge distillation: raising T reveals the structural similarity information buried in a teacher model’s near-zero probabilities. It’s also used for calibration (adjusting model confidence to match actual accuracy) and for controlling sampling diversity at inference time.

How It Works

As T → ∞, all approach 1/N (uniform). As T → 0, all probability collapses to the argmax. For distillation, T=3–10 is typical: dominant class still wins, but runner-ups reveal similarity structure.

Key Sources

  • distillation — temperature scaling enables soft targets in knowledge distillation
  • sampling — temperature controls diversity in LLM generation