The Problem

A neural network outputs probabilities for each class. The probabilities sum to 1 and the argmax is right most of the time. But are the probabilities themselves trustworthy? When the model says “99% sure,” is it actually correct 99% of the time? For modern deep networks: typically not.

Miscalibrated models cause real harm in:

  • Medical / safety-critical settings. A triage system that says “97% benign” when ground truth is 70% benign causes false reassurance.
  • Selective prediction. If you reject low-confidence predictions, you need confidence scores to actually reflect probability of correctness.
  • Downstream calibrators. Models feeding into ensembles or decision systems must produce calibrated probabilities or the system breaks.
  • Fraud / dedup / matching. Threshold-based decisions assume the score-to-probability mapping is reliable.

The Key Insight

Calibration can be checked and corrected post-hoc, separately from accuracy. A well-calibrated model satisfies: P(correct | confidence = p) = p for all p. If the model is overconfident (predicts 99% but is right 71% of the time), apply a transformation that reduces confidence; the transformation does not change which class wins, only the magnitudes.

The transformation lives downstream of the trained model. You do not retrain. You fit a small calibrator on a held-out validation set and prepend it to inference.

Mechanism in Plain English

  1. Train your classifier as usual. Record (logits, label) for a held-out validation set.
  2. Build a reliability diagram: bin predictions by confidence (0-10%, 10-20%, …), in each bin plot the actual accuracy. The curve should follow the diagonal; if it falls below, the model is overconfident.
  3. Compute Expected Calibration Error (ECE): the weighted average gap between confidence and accuracy across bins.
  4. Fit a calibrator (temperature scaling, Platt, isotonic). The calibrator takes raw logits or probabilities and produces calibrated ones.
  5. At inference time, run the model, then run the calibrator before reporting confidence.

ASCII Diagram

Reliability diagram (X = predicted confidence, Y = actual accuracy):

  1.0 |              /
      |          / |
      |       /   .|   Diagonal = perfectly calibrated
  0.5 |    /  .    |
      |  /  .      |
      |/.    miscalibrated (below diagonal = overconfident)
  0.0 |____________|
      0.0   0.5   1.0

After temperature scaling, the curve climbs toward the diagonal.

Math with Translation

Expected Calibration Error:

  • M — number of bins (typically 10 or 15)
  • B_m — set of predictions falling in bin m
  • |B_m| / n — fraction of total predictions in this bin
  • acc(B_m) — fraction of predictions in B_m that were correct
  • conf(B_m) — average confidence of predictions in B_m

A perfectly calibrated model has ECE = 0.

Temperature scaling, the simplest fix: divide logits by a learned scalar T before softmax.

T > 1 softens the distribution (less confident); T < 1 sharpens. Learn T by minimizing NLL on validation.

Concrete Walkthrough

ResNet-110 on CIFAR-100, validation set:

Bin (confidence range)  Avg conf  Accuracy   Gap
[0.93, 1.00]            0.99      0.71       +0.28 (overconfident)
[0.86, 0.93]            0.89      0.67       +0.22
[0.80, 0.86]            0.83      0.62       +0.21
...
[0.00, 0.20]            0.10      0.18       -0.08 (slightly underconfident)

ECE = sum(bin_size * gap) / total ≈ 16.5%

Fit T by minimizing NLL on validation:
  NLL(T) = -sum log(softmax(logits / T)[label])
Found T ≈ 2.07.

After T-scaling on test set:
  Bin             Avg conf  Accuracy   Gap
  [0.93, 1.00]    0.95      0.93       +0.02
  ...
  ECE ≈ 1.3%

Test accuracy: unchanged (T-scaling preserves argmax).

What’s Clever

Calibration can be improved without retraining and without accuracy loss. You separate two concerns: which class wins (accuracy) and how confident to be (calibration). Architectures, objectives, optimizers all influence both, but post-hoc calibrators can fix the second cheaply.

The diagnostic (reliability diagram + ECE) reveals the gap that everyone had been ignoring. Once visible, the fix is obvious: rescale.

Key Sources

Open Questions

  • Distribution shift. Calibration is sensitive to the gap between validation and test distributions. Methods that calibrate under shift (e.g., conformal prediction) are an open area.
  • Multi-task / multi-modal. A single temperature may not work when the model has heterogeneous heads. Per-task or per-class calibration is more robust.
  • Calibration-aware training. Adding calibration losses (e.g., focal loss, label smoothing) to training partially substitutes for post-hoc calibration but trades off accuracy.