Temperature Scaling

What It Is

Temperature scaling divides a neural network’s raw logits by a scalar T before applying softmax. At T=1 you get the standard softmax. At T>1 the output distribution becomes softer (more spread out, higher entropy). At T<1 the distribution becomes sharper (more confident, closer to one-hot).

Why It Matters

Temperature is the key control knob in knowledge distillation: raising T reveals the structural similarity information buried in a teacher model’s near-zero probabilities. It’s also used for calibration (adjusting model confidence to match actual accuracy) and for controlling sampling diversity at inference time.

How It Works

$q_{i} = \frac{e ^{z_{i} / T}}{\sum _{j} e ^{z_{j} / T}}$

As T → ∞, all $q_{i}$ approach 1/N (uniform). As T → 0, all probability collapses to the argmax. For distillation, T=3–10 is typical: dominant class still wins, but runner-ups reveal similarity structure.

Key Sources

distillation — temperature scaling enables soft targets in knowledge distillation
sampling — temperature controls diversity in LLM generation

ML Wiki

Explorer

Temperature Scaling

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Temperature Scaling

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks