The Problem
A classifier with 90% accuracy might be:
- Correctly modest: predicts 90% confidence on average, right 90% of the time. Calibrated.
- Overconfident: predicts 99% on average, right 90% of the time. Miscalibrated.
- Underconfident: predicts 70% on average, right 90% of the time. Also miscalibrated.
Accuracy alone cannot tell these apart. You need a metric for the gap between predicted confidence and actual accuracy.
The Key Insight
Bin predictions by confidence (0-10%, 10-20%, …). In each bin, compute the average confidence and the average accuracy. A perfectly calibrated model has these equal in every bin. ECE is the weighted average of the absolute gap, weighted by bin size.
Mechanism in Plain English
- Run model on a held-out test set; record (predicted-confidence, correct-or-not) for each example.
- Partition the [0, 1] confidence range into M equal-width bins (typically M = 10 or 15).
- Drop each prediction into its corresponding bin.
- For each bin: compute average confidence (avg of the predicted probabilities) and average accuracy (fraction correct).
- ECE is the sum, weighted by bin size, of the absolute differences.
Math with Translation
- M — number of bins
- B_m — predictions in bin m
- |B_m| — count of predictions in bin m
- n — total number of predictions
- acc(B_m) — fraction of B_m’s predictions that were correct
- conf(B_m) — average confidence of predictions in B_m
ECE is in [0, 1]. A perfectly calibrated model has ECE = 0. Random guessing on a 50/50 problem with the model always saying “100%” gives ECE = 0.5.
Concrete Walkthrough
Test set with 1000 predictions:
Bin (conf range) Count Avg conf Accuracy Gap Weighted gap
[0.0, 0.1] 50 0.07 0.05 0.02 0.02 * 0.05 = 0.001
[0.1, 0.2] 80 0.16 0.18 0.02 0.02 * 0.08 = 0.0016
[0.2, 0.3] 70 0.25 0.30 0.05 0.05 * 0.07 = 0.0035
[0.3, 0.4] 60 0.36 0.42 0.06 0.06 * 0.06 = 0.0036
[0.4, 0.5] 70 0.46 0.51 0.05 0.05 * 0.07 = 0.0035
[0.5, 0.6] 80 0.55 0.59 0.04 0.04 * 0.08 = 0.0032
[0.6, 0.7] 90 0.65 0.62 0.03 0.03 * 0.09 = 0.0027
[0.7, 0.8] 100 0.75 0.71 0.04 0.04 * 0.10 = 0.0040
[0.8, 0.9] 150 0.86 0.78 0.08 0.08 * 0.15 = 0.0120
[0.9, 1.0] 250 0.97 0.83 0.14 0.14 * 0.25 = 0.0350
ECE = 0.001 + 0.0016 + 0.0035 + 0.0036 + 0.0035 + 0.0032 + 0.0027 + 0.0040 + 0.0120 + 0.0350
= 0.0701
≈ 7.0%
The model is mostly miscalibrated in the high-confidence bins (the last two contribute most of the ECE). This is typical of modern deep networks.
What’s Clever
ECE captures the average gap weighted by where predictions actually fall. A bin no one falls into contributes nothing. A bin where 60% of predictions land contributes proportionally.
The metric has a clear interpretation: “if the model says X% confidence on average, the average accuracy is X% +/- ECE.”
Limitations
- Bin choice matters. Too few bins (M=5) hides miscalibration; too many (M=50) leaves bins too sparse for stable estimates. M=10 or 15 is standard.
- Equal-width bins. Classifiers often have most predictions clumped near 0.99. Equal-width binning is sparse where it matters; equal-mass binning (each bin has the same number of predictions) is sometimes used.
- Averages can hide systematic errors. A model that is +10% gap in some bins and -10% in others has ECE = 10% but is somewhat calibrated on average. Reliability diagrams visualize the full pattern.
- Class-conditional ECE. A single ECE averages across all classes. For imbalanced or asymmetric tasks, per-class ECE reveals more.
Key Sources
- on-calibration-of-modern-neural-networks — Guo et al. popularized ECE in the deep-learning literature.
Related Concepts
- calibration — the parent topic
- temperature-scaling — typical fix for high ECE
- uncertainty-estimation — broader category
Open Questions
- Adaptive binning. Equal-mass binning (KS test style) gives more stable estimates at high confidence; not yet standard.
- Multi-class metrics beyond top-1. ECE typically considers only the top predicted class. “Class-wise” or “full-distribution” calibration metrics generalize this but are less interpretable.