What It Is
Temperature scaling divides a neural network’s raw logits by a scalar T before applying softmax. At T=1 you get the standard softmax. At T>1 the output distribution becomes softer (more spread out, higher entropy). At T<1 the distribution becomes sharper (more confident, closer to one-hot).
Why It Matters
Temperature is the key control knob in knowledge distillation: raising T reveals the structural similarity information buried in a teacher model’s near-zero probabilities. It’s also used for calibration (adjusting model confidence to match actual accuracy) and for controlling sampling diversity at inference time.
How It Works
As T → ∞, all approach 1/N (uniform). As T → 0, all probability collapses to the argmax. For distillation, T=3–10 is typical: dominant class still wins, but runner-ups reveal similarity structure.
Key Sources
Related Concepts
- distillation — temperature scaling enables soft targets in knowledge distillation
- sampling — temperature controls diversity in LLM generation