The Problem

A classifier with 90% accuracy might be:

  • Correctly modest: predicts 90% confidence on average, right 90% of the time. Calibrated.
  • Overconfident: predicts 99% on average, right 90% of the time. Miscalibrated.
  • Underconfident: predicts 70% on average, right 90% of the time. Also miscalibrated.

Accuracy alone cannot tell these apart. You need a metric for the gap between predicted confidence and actual accuracy.

The Key Insight

Bin predictions by confidence (0-10%, 10-20%, …). In each bin, compute the average confidence and the average accuracy. A perfectly calibrated model has these equal in every bin. ECE is the weighted average of the absolute gap, weighted by bin size.

Mechanism in Plain English

  1. Run model on a held-out test set; record (predicted-confidence, correct-or-not) for each example.
  2. Partition the [0, 1] confidence range into M equal-width bins (typically M = 10 or 15).
  3. Drop each prediction into its corresponding bin.
  4. For each bin: compute average confidence (avg of the predicted probabilities) and average accuracy (fraction correct).
  5. ECE is the sum, weighted by bin size, of the absolute differences.

Math with Translation

  • M — number of bins
  • B_m — predictions in bin m
  • |B_m| — count of predictions in bin m
  • n — total number of predictions
  • acc(B_m) — fraction of B_m’s predictions that were correct
  • conf(B_m) — average confidence of predictions in B_m

ECE is in [0, 1]. A perfectly calibrated model has ECE = 0. Random guessing on a 50/50 problem with the model always saying “100%” gives ECE = 0.5.

Concrete Walkthrough

Test set with 1000 predictions:

Bin (conf range)  Count   Avg conf   Accuracy   Gap     Weighted gap
[0.0, 0.1]        50      0.07       0.05       0.02    0.02 * 0.05 = 0.001
[0.1, 0.2]        80      0.16       0.18       0.02    0.02 * 0.08 = 0.0016
[0.2, 0.3]        70      0.25       0.30       0.05    0.05 * 0.07 = 0.0035
[0.3, 0.4]        60      0.36       0.42       0.06    0.06 * 0.06 = 0.0036
[0.4, 0.5]        70      0.46       0.51       0.05    0.05 * 0.07 = 0.0035
[0.5, 0.6]        80      0.55       0.59       0.04    0.04 * 0.08 = 0.0032
[0.6, 0.7]        90      0.65       0.62       0.03    0.03 * 0.09 = 0.0027
[0.7, 0.8]        100     0.75       0.71       0.04    0.04 * 0.10 = 0.0040
[0.8, 0.9]        150     0.86       0.78       0.08    0.08 * 0.15 = 0.0120
[0.9, 1.0]        250     0.97       0.83       0.14    0.14 * 0.25 = 0.0350

ECE = 0.001 + 0.0016 + 0.0035 + 0.0036 + 0.0035 + 0.0032 + 0.0027 + 0.0040 + 0.0120 + 0.0350
    = 0.0701
    ≈ 7.0%

The model is mostly miscalibrated in the high-confidence bins (the last two contribute most of the ECE). This is typical of modern deep networks.

What’s Clever

ECE captures the average gap weighted by where predictions actually fall. A bin no one falls into contributes nothing. A bin where 60% of predictions land contributes proportionally.

The metric has a clear interpretation: “if the model says X% confidence on average, the average accuracy is X% +/- ECE.”

Limitations

  • Bin choice matters. Too few bins (M=5) hides miscalibration; too many (M=50) leaves bins too sparse for stable estimates. M=10 or 15 is standard.
  • Equal-width bins. Classifiers often have most predictions clumped near 0.99. Equal-width binning is sparse where it matters; equal-mass binning (each bin has the same number of predictions) is sometimes used.
  • Averages can hide systematic errors. A model that is +10% gap in some bins and -10% in others has ECE = 10% but is somewhat calibrated on average. Reliability diagrams visualize the full pattern.
  • Class-conditional ECE. A single ECE averages across all classes. For imbalanced or asymmetric tasks, per-class ECE reveals more.

Key Sources

Open Questions

  • Adaptive binning. Equal-mass binning (KS test style) gives more stable estimates at high confidence; not yet standard.
  • Multi-class metrics beyond top-1. ECE typically considers only the top predicted class. “Class-wise” or “full-distribution” calibration metrics generalize this but are less interpretable.