On Calibration of Modern Neural Networks

Concepts: calibration | temperature-scaling | uncertainty-estimation | isotonic-regression | expected-calibration-error

In a 2005 paper, a small CNN was found to be naturally well-calibrated: when it said it was 80% sure, it was right roughly 80% of the time. By 2017, ResNets were everywhere, accuracy had jumped, and a strange thing had happened: the same CNN family, trained deeper and wider, was now overconfident. A ResNet-110 on CIFAR-100 saying it is 99% sure is right only about 71% of the time. Guo et al. document this regression and propose an embarrassingly simple fix: divide the logits by a single scalar T learned on a validation set, then take the softmax. That’s it.

The core idea

The analogy: Imagine a thermometer that reads correctly on average but always exaggerates temperature changes. When the room is mildly warm it says “scorching.” When mildly cool, “freezing.” The fix is not to redesign the thermometer; it is to multiply every reading by 0.7. The needle’s direction (which way is warmer) is unchanged, but the magnitudes get rescaled to match reality.

Modern neural network logits behave like that thermometer. The argmax of the softmax (the predicted class) is roughly correct as accuracy grows. But the softmax of the raw logits gives probabilities that are systematically too peaked. A network that should say “75% cat, 15% dog, 10% other” instead says “99% cat, 0.5% dog, 0.5% other.” Same prediction; wrong confidence.

“Modern neural networks, unlike those from a decade ago, are poorly calibrated.”

The fix is temperature scaling: divide every logit by a single scalar T > 1 before softmax. The argmax does not change (dividing all logits by the same number preserves their order). Only the spread of the resulting probabilities changes. Tune T on a held-out validation set to minimize negative log likelihood (NLL). Typical learned T values are between 1.5 and 3 for image classifiers.

What’s clever — find the instinct

The non-obvious move is the diagnostic that recovered the result. Before this paper, “calibration” was discussed mostly in the medical / forecasting literature, not in deep learning. The authors revived a 1980s-era visualization (the reliability diagram) and the Expected Calibration Error (ECE) metric, then applied them to ResNets and discovered the gap.

A reliability diagram bins predictions by confidence (0-10%, 10-20%, …, 90-100%) and plots, in each bin, the fraction of predictions that were actually correct. A perfectly calibrated model lies on the diagonal. The authors show that virtually every modern image classifier has a curve that falls below the diagonal in the top bins — overconfident on its high-probability predictions.

“We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration.”

The second clever move: they tested every reasonable post-hoc calibrator (Platt scaling, isotonic regression, histogram binning, matrix scaling, vector scaling) on every reasonable architecture (ResNet, DenseNet, Wide ResNet) on every reasonable dataset (CIFAR, ImageNet, document classification). Temperature scaling — the simplest method, with a single learnable parameter — won or tied on almost every combination.

“On most datasets, temperature scaling — a single-parameter variant of Platt Scaling — is surprisingly effective at calibrating predictions.”

Why does it work so well? Their hypothesis: the miscalibration of modern networks is roughly a single-axis distortion (overall confidence is too high), not a per-class or per-input distortion. A single parameter is enough.

Walkthrough: temperature scaling on a ResNet

Setup: ResNet-110 on CIFAR-100. After training, take 5,000
       held-out validation examples; record (logits, label) for each.

Step 1: Build reliability diagram on validation set.
        For each (logits, label):
            confidence = max(softmax(logits))
            correct = (argmax(logits) == label)
        Bin by confidence into 15 equal-width bins [0, 1/15, 2/15, ..., 14/15, 1].

Bin     Avg confidence    Avg accuracy   Gap
[0.93, 1.00]    0.99           0.71      +0.28  (overconfident)
[0.86, 0.93]    0.89           0.67      +0.22
[0.80, 0.86]    0.83           0.62      +0.21
...

ECE = sum over bins of (bin_size / N) * |confidence - accuracy|
    ≈ 0.16 (uncalibrated)

Step 2: Learn T by minimizing NLL on validation set.
        NLL(T) = - sum log( softmax(logits / T)[label] )
        Minimize over T in (0, 10].
        Learned T ≈ 2.07.

Step 3: Verify on test set.
        Apply softmax(logits / 2.07).
        New ECE ≈ 0.013.

Step 4: Sanity check.
        argmax(logits / T) == argmax(logits)  →  accuracy unchanged
        Only the confidence distribution shifts.

The before / after on ECE drops by 10x for one parameter. Compared to isotonic regression (which fits a piecewise-constant function) or histogram binning (which fits a per-bin lookup), temperature scaling has 1 parameter vs. 100+, generalizes better, and never makes calibration worse.

Does it work? What breaks?

Architecture	Dataset	ECE before	ECE after (T-scaling)	Test accuracy change
ResNet-110	CIFAR-100	16.5%	1.3%	0.0%
ResNet-152 (SD)	CIFAR-100	12.7%	1.4%	0.0%
DenseNet-40	CIFAR-100	9.5%	0.5%	0.0%
ResNet-50	ImageNet	6.6%	1.9%	0.0%
TreeLSTM	SST	6.6%	0.6%	0.0%

Across all settings, temperature scaling reduces ECE by 5-10x while leaving accuracy unchanged.

The paper also dissects what causes miscalibration:

Depth: Deeper networks are more miscalibrated (controlling for accuracy).
Width: Wider networks are more miscalibrated.
Weight decay: Less weight decay → worse calibration (the regularizer was acting as a calibrator!).
Batch normalization: Networks with BatchNorm are more miscalibrated than equivalent non-BN networks.

The likely mechanism: NLL training pushes the logit norms ever larger to drive the argmax confidence to 1.0, even after accuracy plateaus. Modern architectures (deeper, wider, with BN) are better at finding this regime.

“We conjecture that the network is over-fitting to the NLL without overfitting to the 0/1 loss.”

What breaks:

Distribution shift. Temperature scaling assumes the test distribution matches the validation distribution. Under shift, a calibrated model can become miscalibrated again. (See follow-up work: Ovadia et al. 2019, “Can you trust your model’s uncertainty?“)
Per-class miscalibration. If miscalibration is class-asymmetric (e.g., the model is overconfident on majority classes and underconfident on minority classes), a single T cannot fix it. Vector scaling (one T per class) helps but adds parameters.
Selective prediction. If you reject low-confidence inputs, the calibrator only sees the accepted ones, which can re-introduce bias.
Predictive entropy is preserved, but the ranking of confidences across inputs may not match ground-truth probabilities perfectly. T-scaling fixes the average; it does not fix per-input issues.

So what?

For a practitioner shipping a model that needs reliable confidence (medical triage, fraud, dedup, search ranking, abstention):

Always check ECE before believing your model’s confidence. Modern networks lie about probabilities by default. Even if accuracy is fine, the calibration is probably broken.
Add temperature scaling as the last layer. Train normally, then fit T on a held-out validation set. Five lines of code; near-zero risk; large gains.
For two-class / binary problems, also try Platt scaling. It is a logistic function fit (slope a, intercept b) on the logits; T-scaling is a special case (a=1/T, b=0). When the bias term matters (e.g., highly imbalanced classes), Platt outperforms T-scaling slightly.
For binary problems where the data is non-Gaussian, isotonic regression is the right tool. It fits a non-parametric monotone function from raw scores to calibrated probabilities. More parameters than T-scaling, more flexibility, slightly worse generalization to small validation sets. (For example: dedup-style tasks with structured input often prefer isotonic.)
Recalibrate after retraining. T or any other fitted calibrator depends on the model’s logit distribution. Retraining shifts the distribution; the calibrator must be refit.

For Saikat’s POI dedup pipeline specifically: isotonic regression remains a defensible choice when the calibration mapping is plausibly non-monotone-linear. But adding a temperature-scaling baseline is a five-line experiment that often wins, and gives a clean parameter count for ablations. The paper’s reliability-diagram diagnostic is the right way to quantify the gap before deciding which calibrator to deploy.

Connections

calibration — this is the canonical reference for the field
temperature-scaling — the recommended default calibrator
uncertainty-estimation — calibration is the foundation of trustworthy uncertainty
isotonic-regression — the non-parametric alternative for binary tasks
expected-calibration-error — the metric introduced for measuring calibration gap
knowledge-distillation-hinton — temperature was introduced for distillation; here it is repurposed for calibration

Citation

arXiv:1706.04599

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017. https://arxiv.org/abs/1706.04599

ML Wiki

Explorer