The Problem
You have a binary classifier producing scores s_i in [0, 1] (or any monotone-rankable values), and binary labels y_i in {0, 1}. The classifier ranks correctly (higher scores → more likely to be positive) but the score values themselves are not calibrated probabilities. You need a function f(s) → p that maps scores to probabilities, that respects the monotone relationship and minimizes some calibration loss.
A linear fit (Platt scaling) is too restrictive when the score-to-probability map is non-linear. A free-form fit (any function) overfits and may not be monotone, breaking the rank order. You want a non-parametric, monotone fit.
The Key Insight
Isotonic regression finds the best step function f(s) that is monotonically non-decreasing. It minimizes squared error to the labels subject to f being non-decreasing in s. The solution can be computed in O(n log n) via the pool-adjacent-violators algorithm (PAV). The result is a piecewise-constant function: it climbs in discrete steps as s grows.
The non-parametric form lets it fit arbitrary monotone shapes (sigmoid-like, log-like, sharp threshold-like) without assuming a parametric family.
Mechanism in Plain English
- Sort training pairs (s_i, y_i) by score s_i.
- Initialize: each pair is its own pool, with mean = y_i.
- PAV pass: for each adjacent pair of pools (pool_i, pool_{i+1}) where mean(pool_i) > mean(pool_{i+1}) (a violation of monotonicity), merge them. The merged pool’s mean is the weighted average of constituent means.
- Repeat until no violations remain.
- The fitted function: for an input score s, find which pool s falls into; output the pool’s mean.
The PAV algorithm runs in O(n log n) for sorting plus O(n) for the merge pass.
Math with Translation
Minimize:
subject to: f(s_1) ⇐ f(s_2) ⇐ … ⇐ f(s_n) when s_1 ⇐ s_2 ⇐ … ⇐ s_n.
- w_i — sample weight (often 1)
- f — the monotone function being fit
- The constraint is: never let f decrease as s increases.
The Lagrangian solution is the PAV: greedily enforce monotonicity by pooling violators.
Concrete Walkthrough
Five samples sorted by score:
i score s_i label y_i
0 0.10 0
1 0.20 0
2 0.30 1 <- mean(2) = 1.0; mean(1) = 0.0 — fine, 1.0 > 0.0
3 0.40 0 <- mean(3) = 0.0 < mean(2) = 1.0 — violation!
4 0.50 1
Step 1: Pool indices 2 and 3 (0.30 and 0.40 scores).
Combined: mean = (1+0) / 2 = 0.5.
Pool now spans s in [0.30, 0.40].
Step 2: Check mean(pool 0..1) = 0; mean(merged 2-3) = 0.5; mean(4) = 1.0. Monotonic.
Final fitted f(s):
s in [0.10, 0.20]: f(s) = 0.0
s in [0.30, 0.40]: f(s) = 0.5
s = 0.50: f(s) = 1.0
For a new score s = 0.35: f(0.35) = 0.5. For s = 0.45: f(0.45) = 0.5. For s = 0.55 (above max): clamp to f(0.50) = 1.0.
What’s Clever
Isotonic regression buys you flexibility (any monotone shape) without giving up the rank-preserving property of the original classifier. It is also easy to evaluate (binary search to find the pool) and easy to interpret (look at the fitted step function and see where the model’s confidence diverges from reality).
The trade-off vs temperature scaling: isotonic has more parameters (one per pool, often dozens), so it can overfit on small validation sets. Temperature scaling has 1 parameter and generalizes better. Use isotonic when you have enough validation data and the monotone-but-non-linear shape matters.
Code
from sklearn.isotonic import IsotonicRegression
# Train
ir = IsotonicRegression(out_of_bounds='clip')
ir.fit(scores_val, labels_val)
# Predict calibrated probabilities
scores_test = model.predict_proba(X_test)[:, 1]
probs_calibrated = ir.predict(scores_test)Key Sources
- on-calibration-of-modern-neural-networks — Guo et al. evaluate isotonic among other calibrators
Related Concepts
- calibration — the parent topic
- temperature-scaling — the parametric alternative; better generalization, less flexibility
- uncertainty-estimation — calibration is the foundation of usable uncertainty
Open Questions
- Multi-class isotonic. Native isotonic is for binary. Multi-class extensions (per-class one-vs-rest, or matrix isotonic via cyclic projection) exist but are less standard.
- Online isotonic. PAV is offline. Streaming variants exist for online recalibration but trade some accuracy.