Isotonic Regression

The Problem

You have a binary classifier producing scores s_i in [0, 1] (or any monotone-rankable values), and binary labels y_i in {0, 1}. The classifier ranks correctly (higher scores → more likely to be positive) but the score values themselves are not calibrated probabilities. You need a function f(s) → p that maps scores to probabilities, that respects the monotone relationship and minimizes some calibration loss.

A linear fit (Platt scaling) is too restrictive when the score-to-probability map is non-linear. A free-form fit (any function) overfits and may not be monotone, breaking the rank order. You want a non-parametric, monotone fit.

The Key Insight

Isotonic regression finds the best step function f(s) that is monotonically non-decreasing. It minimizes squared error to the labels subject to f being non-decreasing in s. The solution can be computed in O(n log n) via the pool-adjacent-violators algorithm (PAV). The result is a piecewise-constant function: it climbs in discrete steps as s grows.

The non-parametric form lets it fit arbitrary monotone shapes (sigmoid-like, log-like, sharp threshold-like) without assuming a parametric family.

Mechanism in Plain English

Sort training pairs (s_i, y_i) by score s_i.
Initialize: each pair is its own pool, with mean = y_i.
PAV pass: for each adjacent pair of pools (pool_i, pool_{i+1}) where mean(pool_i) > mean(pool_{i+1}) (a violation of monotonicity), merge them. The merged pool’s mean is the weighted average of constituent means.
Repeat until no violations remain.
The fitted function: for an input score s, find which pool s falls into; output the pool’s mean.

The PAV algorithm runs in O(n log n) for sorting plus O(n) for the merge pass.

Math with Translation

Minimize:

$\sum_{i} w_{i} (f (s_{i}) - y_{i})^{2}$

subject to: f(s_1) ⇐ f(s_2) ⇐ … ⇐ f(s_n) when s_1 ⇐ s_2 ⇐ … ⇐ s_n.

w_i — sample weight (often 1)
f — the monotone function being fit
The constraint is: never let f decrease as s increases.

The Lagrangian solution is the PAV: greedily enforce monotonicity by pooling violators.

Concrete Walkthrough

Five samples sorted by score:

i  score s_i  label y_i
0  0.10       0
1  0.20       0
2  0.30       1     <- mean(2) = 1.0; mean(1) = 0.0 — fine, 1.0 > 0.0
3  0.40       0     <- mean(3) = 0.0 < mean(2) = 1.0 — violation!
4  0.50       1

Step 1: Pool indices 2 and 3 (0.30 and 0.40 scores).
  Combined: mean = (1+0) / 2 = 0.5.
  Pool now spans s in [0.30, 0.40].

Step 2: Check mean(pool 0..1) = 0; mean(merged 2-3) = 0.5; mean(4) = 1.0. Monotonic.

Final fitted f(s):
  s in [0.10, 0.20]: f(s) = 0.0
  s in [0.30, 0.40]: f(s) = 0.5
  s = 0.50:           f(s) = 1.0

For a new score s = 0.35: f(0.35) = 0.5. For s = 0.45: f(0.45) = 0.5. For s = 0.55 (above max): clamp to f(0.50) = 1.0.

What’s Clever

Isotonic regression buys you flexibility (any monotone shape) without giving up the rank-preserving property of the original classifier. It is also easy to evaluate (binary search to find the pool) and easy to interpret (look at the fitted step function and see where the model’s confidence diverges from reality).

The trade-off vs temperature scaling: isotonic has more parameters (one per pool, often dozens), so it can overfit on small validation sets. Temperature scaling has 1 parameter and generalizes better. Use isotonic when you have enough validation data and the monotone-but-non-linear shape matters.

Code

from sklearn.isotonic import IsotonicRegression
 
# Train
ir = IsotonicRegression(out_of_bounds='clip')
ir.fit(scores_val, labels_val)
 
# Predict calibrated probabilities
scores_test = model.predict_proba(X_test)[:, 1]
probs_calibrated = ir.predict(scores_test)

Key Sources

on-calibration-of-modern-neural-networks — Guo et al. evaluate isotonic among other calibrators

calibration — the parent topic
temperature-scaling — the parametric alternative; better generalization, less flexibility
uncertainty-estimation — calibration is the foundation of usable uncertainty

Open Questions

Multi-class isotonic. Native isotonic is for binary. Multi-class extensions (per-class one-vs-rest, or matrix isotonic via cyclic projection) exist but are less standard.
Online isotonic. PAV is offline. Streaming variants exist for online recalibration but trade some accuracy.

ML Wiki

Explorer

Isotonic Regression

The Problem

The Key Insight

Mechanism in Plain English

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Isotonic Regression

The Problem

The Key Insight

Mechanism in Plain English

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks