Expected Calibration Error (ECE)

The Problem

A classifier with 90% accuracy might be:

Correctly modest: predicts 90% confidence on average, right 90% of the time. Calibrated.
Overconfident: predicts 99% on average, right 90% of the time. Miscalibrated.
Underconfident: predicts 70% on average, right 90% of the time. Also miscalibrated.

Accuracy alone cannot tell these apart. You need a metric for the gap between predicted confidence and actual accuracy.

The Key Insight

Bin predictions by confidence (0-10%, 10-20%, …). In each bin, compute the average confidence and the average accuracy. A perfectly calibrated model has these equal in every bin. ECE is the weighted average of the absolute gap, weighted by bin size.

Mechanism in Plain English

Run model on a held-out test set; record (predicted-confidence, correct-or-not) for each example.
Partition the [0, 1] confidence range into M equal-width bins (typically M = 10 or 15).
Drop each prediction into its corresponding bin.
For each bin: compute average confidence (avg of the predicted probabilities) and average accuracy (fraction correct).
ECE is the sum, weighted by bin size, of the absolute differences.

Math with Translation

$ECE = \sum_{m = 1}^{M} \frac{∣ B _{m} ∣}{n} ∣ a cc (B_{m}) - co n f (B_{m}) ∣$

M — number of bins
B_m — predictions in bin m
|B_m| — count of predictions in bin m
n — total number of predictions
acc(B_m) — fraction of B_m’s predictions that were correct
conf(B_m) — average confidence of predictions in B_m

ECE is in [0, 1]. A perfectly calibrated model has ECE = 0. Random guessing on a 50/50 problem with the model always saying “100%” gives ECE = 0.5.

Concrete Walkthrough

Test set with 1000 predictions:

Bin (conf range)  Count   Avg conf   Accuracy   Gap     Weighted gap
[0.0, 0.1]        50      0.07       0.05       0.02    0.02 * 0.05 = 0.001
[0.1, 0.2]        80      0.16       0.18       0.02    0.02 * 0.08 = 0.0016
[0.2, 0.3]        70      0.25       0.30       0.05    0.05 * 0.07 = 0.0035
[0.3, 0.4]        60      0.36       0.42       0.06    0.06 * 0.06 = 0.0036
[0.4, 0.5]        70      0.46       0.51       0.05    0.05 * 0.07 = 0.0035
[0.5, 0.6]        80      0.55       0.59       0.04    0.04 * 0.08 = 0.0032
[0.6, 0.7]        90      0.65       0.62       0.03    0.03 * 0.09 = 0.0027
[0.7, 0.8]        100     0.75       0.71       0.04    0.04 * 0.10 = 0.0040
[0.8, 0.9]        150     0.86       0.78       0.08    0.08 * 0.15 = 0.0120
[0.9, 1.0]        250     0.97       0.83       0.14    0.14 * 0.25 = 0.0350

ECE = 0.001 + 0.0016 + 0.0035 + 0.0036 + 0.0035 + 0.0032 + 0.0027 + 0.0040 + 0.0120 + 0.0350
    = 0.0701
    ≈ 7.0%

The model is mostly miscalibrated in the high-confidence bins (the last two contribute most of the ECE). This is typical of modern deep networks.

What’s Clever

ECE captures the average gap weighted by where predictions actually fall. A bin no one falls into contributes nothing. A bin where 60% of predictions land contributes proportionally.

The metric has a clear interpretation: “if the model says X% confidence on average, the average accuracy is X% +/- ECE.”

Limitations

Bin choice matters. Too few bins (M=5) hides miscalibration; too many (M=50) leaves bins too sparse for stable estimates. M=10 or 15 is standard.
Equal-width bins. Classifiers often have most predictions clumped near 0.99. Equal-width binning is sparse where it matters; equal-mass binning (each bin has the same number of predictions) is sometimes used.
Averages can hide systematic errors. A model that is +10% gap in some bins and -10% in others has ECE = 10% but is somewhat calibrated on average. Reliability diagrams visualize the full pattern.
Class-conditional ECE. A single ECE averages across all classes. For imbalanced or asymmetric tasks, per-class ECE reveals more.

Key Sources

on-calibration-of-modern-neural-networks — Guo et al. popularized ECE in the deep-learning literature.

calibration — the parent topic
temperature-scaling — typical fix for high ECE
uncertainty-estimation — broader category

Open Questions

Adaptive binning. Equal-mass binning (KS test style) gives more stable estimates at high confidence; not yet standard.
Multi-class metrics beyond top-1. ECE typically considers only the top predicted class. “Class-wise” or “full-distribution” calibration metrics generalize this but are less interpretable.

ML Wiki

Explorer

Expected Calibration Error (ECE)

The Problem

The Key Insight

Mechanism in Plain English

Math with Translation

Concrete Walkthrough

What’s Clever

Limitations

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Expected Calibration Error (ECE)

The Problem

The Key Insight

Mechanism in Plain English

Math with Translation

Concrete Walkthrough

What’s Clever

Limitations

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks