The Problem
A neural network outputs probabilities for each class. The probabilities sum to 1 and the argmax is right most of the time. But are the probabilities themselves trustworthy? When the model says “99% sure,” is it actually correct 99% of the time? For modern deep networks: typically not.
Miscalibrated models cause real harm in:
- Medical / safety-critical settings. A triage system that says “97% benign” when ground truth is 70% benign causes false reassurance.
- Selective prediction. If you reject low-confidence predictions, you need confidence scores to actually reflect probability of correctness.
- Downstream calibrators. Models feeding into ensembles or decision systems must produce calibrated probabilities or the system breaks.
- Fraud / dedup / matching. Threshold-based decisions assume the score-to-probability mapping is reliable.
The Key Insight
Calibration can be checked and corrected post-hoc, separately from accuracy. A well-calibrated model satisfies: P(correct | confidence = p) = p for all p. If the model is overconfident (predicts 99% but is right 71% of the time), apply a transformation that reduces confidence; the transformation does not change which class wins, only the magnitudes.
The transformation lives downstream of the trained model. You do not retrain. You fit a small calibrator on a held-out validation set and prepend it to inference.
Mechanism in Plain English
- Train your classifier as usual. Record (logits, label) for a held-out validation set.
- Build a reliability diagram: bin predictions by confidence (0-10%, 10-20%, …), in each bin plot the actual accuracy. The curve should follow the diagonal; if it falls below, the model is overconfident.
- Compute Expected Calibration Error (ECE): the weighted average gap between confidence and accuracy across bins.
- Fit a calibrator (temperature scaling, Platt, isotonic). The calibrator takes raw logits or probabilities and produces calibrated ones.
- At inference time, run the model, then run the calibrator before reporting confidence.
ASCII Diagram
Reliability diagram (X = predicted confidence, Y = actual accuracy):
1.0 | /
| / |
| / .| Diagonal = perfectly calibrated
0.5 | / . |
| / . |
|/. miscalibrated (below diagonal = overconfident)
0.0 |____________|
0.0 0.5 1.0
After temperature scaling, the curve climbs toward the diagonal.
Math with Translation
Expected Calibration Error:
- M — number of bins (typically 10 or 15)
- B_m — set of predictions falling in bin m
- |B_m| / n — fraction of total predictions in this bin
- acc(B_m) — fraction of predictions in B_m that were correct
- conf(B_m) — average confidence of predictions in B_m
A perfectly calibrated model has ECE = 0.
Temperature scaling, the simplest fix: divide logits by a learned scalar T before softmax.
T > 1 softens the distribution (less confident); T < 1 sharpens. Learn T by minimizing NLL on validation.
Concrete Walkthrough
ResNet-110 on CIFAR-100, validation set:
Bin (confidence range) Avg conf Accuracy Gap
[0.93, 1.00] 0.99 0.71 +0.28 (overconfident)
[0.86, 0.93] 0.89 0.67 +0.22
[0.80, 0.86] 0.83 0.62 +0.21
...
[0.00, 0.20] 0.10 0.18 -0.08 (slightly underconfident)
ECE = sum(bin_size * gap) / total ≈ 16.5%
Fit T by minimizing NLL on validation:
NLL(T) = -sum log(softmax(logits / T)[label])
Found T ≈ 2.07.
After T-scaling on test set:
Bin Avg conf Accuracy Gap
[0.93, 1.00] 0.95 0.93 +0.02
...
ECE ≈ 1.3%
Test accuracy: unchanged (T-scaling preserves argmax).
What’s Clever
Calibration can be improved without retraining and without accuracy loss. You separate two concerns: which class wins (accuracy) and how confident to be (calibration). Architectures, objectives, optimizers all influence both, but post-hoc calibrators can fix the second cheaply.
The diagnostic (reliability diagram + ECE) reveals the gap that everyone had been ignoring. Once visible, the fix is obvious: rescale.
Key Sources
- on-calibration-of-modern-neural-networks — Guo et al. 2017, the modern reference
Related Concepts
- temperature-scaling — the simplest and most reliable calibrator
- uncertainty-estimation — calibration is a pillar of trustworthy uncertainty
- isotonic-regression — non-parametric alternative for binary problems
- expected-calibration-error — the metric
Open Questions
- Distribution shift. Calibration is sensitive to the gap between validation and test distributions. Methods that calibrate under shift (e.g., conformal prediction) are an open area.
- Multi-task / multi-modal. A single temperature may not work when the model has heterogeneous heads. Per-task or per-class calibration is more robust.
- Calibration-aware training. Adding calibration losses (e.g., focal loss, label smoothing) to training partially substitutes for post-hoc calibration but trades off accuracy.