Concepts: ensemble-methods | uncertainty-estimation | self-consistency | vision-language-models Builds on: self-consistency-chain-of-thought-reasoning Leads to: (future work on multi-model agreement in VLMs)
You’ve built an OCR pipeline using the best vision-language models available — Qwen2.5-VL-72B, GPT-4o. The models score well on benchmarks. Then you deploy to production and discover: the models routinely make quiet, confident mistakes. A digit transposed. A formula character missed. And you have no way to know which outputs to trust. The benchmark score tells you the average; it tells you nothing about whether this particular output is correct.
This paper offers a simple answer to that question: ask multiple models and measure how much they agree.
The core idea
The analogy: Imagine asking ten witnesses to describe what they saw at an accident. If nine of them independently give the same account — the car was red, it ran the light, the driver was wearing a blue hat — you have high confidence that account is accurate. If the witnesses give wildly different stories, something is wrong: either the event was ambiguous, or some witnesses didn’t actually see clearly. You don’t need a judge to tell you which account is right. The agreement pattern tells you.
Consensus Entropy (CE) works the same way for OCR. Run multiple VLMs on the same image. If they all produce nearly identical text, the output is probably correct. If they diverge, something in the image is hard — a smudged character, unusual font, complex layout — and the output needs more attention. No labels required. No training needed. The signal comes entirely from the pattern of disagreement.
The key observation the authors make after studying 210 VLMs on OCRBench: “correct outputs naturally converge in a shared representation space while erroneous outputs diverge.” This is the insight everything else builds on.
The mechanism, step by step:
- Send the same image to N independent VLMs (e.g., GPT-4o, Qwen2.5-VL-7B, InternVL2.5-8B).
- Collect N text outputs.
- Measure pairwise similarity between every pair of outputs. For character-precise tasks (OCR, math), use edit distance. For semantic tasks (VQA), use cosine similarity between text embeddings.
- For each model’s output, compute its average distance to all other outputs: how isolated is this prediction?
- Aggregate these distances into a single entropy score using kernel density estimation or direct averaging.
- Compare against threshold :
- Low entropy: models agree. Trust the weighted ensemble.
- High entropy: models disagree. Route to a stronger model.
IMAGE
|
+-----+------+-----+
| | | |
VLM1 VLM2 VLM3 VLM4
"The "The "Tbe "The
cat" cat" cat" cat"
| | | |
+-----+------+-----+
|
Pairwise Similarity
(Edit Distance)
|
VLM1 avg dist: 0.1 (close to consensus)
VLM2 avg dist: 0.1 (close to consensus)
VLM3 avg dist: 0.8 (outlier, likely wrong)
VLM4 avg dist: 0.1 (close to consensus)
|
Consensus Entropy δ
|
δ low? --> Weighted ensemble (higher weight to VLM1, VLM2, VLM4)
δ high? --> Route to stronger VLM (GPT-4o)
The math, translated:
Step one is pairwise entropy. For each pair of model outputs :
where is the probability distribution derived from the position-by-position similarity between the two outputs. Translation: for each character position, how different are the two outputs? High entropy at position means the models put very different characters there.
Step two: each model’s average isolation from the group:
Translation: model ‘s average “weirdness” relative to all other models. A small means this model agrees with the pack. A large one means it’s an outlier.
Step three: ensemble weights favor the agreers:
Translation: models that agree with the consensus get high weight. Outliers get downweighted. You’re building a democratic vote where the representatives with the most corroborated testimony speak loudest.
The final Consensus Entropy is computed over the full output distribution using KDE in the semantic case or mean pairwise distance in the edit-distance case.
Walkthrough with actual numbers:
Four models OCR the same PDF line. The ground truth is “Return on equity: $4.2M”.
Model outputs:
M1: "Return on equity: $4.2M" (correct)
M2: "Return on equity: $4.2M" (correct)
M3: "Return on equity: $42M" (wrong: missing decimal)
M4: "Return on equity: $4.2M" (correct)
Pairwise edit distances (normalized to [0,1]):
E12 = 0.0 (identical)
E13 = 0.08 (one char diff: "4.2" vs "42")
E14 = 0.0 (identical)
E23 = 0.08
E24 = 0.0
E34 = 0.08
Average entropy distances:
E_bar_1 = (0.0 + 0.08 + 0.0) / 3 = 0.027
E_bar_2 = (0.0 + 0.08 + 0.0) / 3 = 0.027
E_bar_3 = (0.08 + 0.08 + 0.08) / 3 = 0.080 <-- outlier
E_bar_4 = (0.0 + 0.0 + 0.08) / 3 = 0.027
Ensemble weights (proportional to 1/E_bar):
w1 = (1/0.027) / sum = 37.0 / 113 = 0.327
w2 = 0.327
w3 = (1/0.080) / sum = 12.5 / 113 = 0.110 <-- downweighted
w4 = 0.327
Consensus selection picks "Return on equity: $4.2M"
because it has 3× the weight of M3's wrong output.
Overall Consensus Entropy in this case is low (the group largely agrees), so the ensemble result is accepted directly. The “$42M” output is suppressed without any labels or judges.
What’s clever:
The deepest insight is that “correct predictions converge while errors diverge” is not an assumption — it’s a structural property of OCR tasks. OCR has (approximately) a unique ground truth. There are many distinct wrong ways to misread a character, but only one right answer. So when models make independent errors, they make different errors. The agreement is correlated with correctness in a way that wouldn’t hold for, say, creative writing tasks where multiple “correct” answers exist.
This is why self-consistency (same model, multiple samples) works for reasoning but is less powerful than cross-model consensus: a single model’s errors are correlated with that model’s specific failure modes. Ask GPT-4o three times, and all three might make the same mistake. Ask three different VLMs trained by different teams on different data, and their errors are largely independent.
“CE requires no training or supervision, enabling plug-and-play integration.”
Translation: you can add this to any existing OCR stack without touching your models, your training data, or your labels.
“CE leverages character-level consensus directly from the outputs of any model (open-source or proprietary), without needing access to interior parameters”
Translation: it works on black-box API models. You only need the output text, not logits or probabilities.
Does it actually work? What breaks?
Verification (can you detect bad OCR without labels?):
| Method | Overall F1 (Qwen2-VL-72B) | vs. Best VLM-as-Judge |
|---|---|---|
| VLM-as-Judge | 39.8 | baseline |
| Consensus Entropy (ours) | 51.0 | +28.1% |
The gap is largest in the hard cases: for outputs the human annotators rated 0.0–0.3 quality, CE achieves F1=69.57 vs. VLM-as-Judge’s 52.83. This is exactly where you want a quality filter to work — catching the worst outputs reliably.
OCR improvement (CE-Ensemble + routing):
| Method | OCRBench Score | vs. Best Single Model |
|---|---|---|
| Best single VLM | 888 (Qwen2VL-72B) | baseline |
| Self-Consistency@3 (best) | 875 | -1.5% |
| CE-OCR Routing (best) | 922 | +8.2% |
| CE-Ensemble (open-source only, <10B params) | 933 | +5.1% over SOTA closed-source |
The open-source result is striking: three sub-10B models (Ovis2-1B, Ovis2-4B, Qwen2VL-7B) combined via CE-Ensemble score 933, beating the SOTA single-model score of 926 — at a fraction of the cost (1 GPU vs. 4×80GB).
What doesn’t work:
CE assumes outputs are independent. If you query the same model family repeatedly (e.g., all Qwen variants), errors are correlated, and the advantage diminishes. The paper shows same-family ensembles still gain (+1.06% to +1.83%), but cross-family ensembles gain more.
CE also degrades on tasks without a unique ground truth. The paper’s ROVER comparison shows this clearly: ROVER fails catastrophically on Doc-VQA (-71.8%) because discrete voting breaks when correct answers vary in phrasing. CE, using continuous similarity, handles this better, but it’s still weaker on open-ended generation than on factual OCR.
Threshold calibration matters. A threshold of works across most OCR tasks, but different tasks may need different settings (Table 7 shows that routes 91.5% of samples and gives 3.8× more accuracy improvement than with only 11.5% routing).
Finally, CE adds latency proportional to the number of models. A 3-model ensemble is 3× the inference cost before routing savings kick in. For very high-volume pipelines, this matters.
So what?
If you’re building ML systems that process documents, invoices, scientific PDFs, or any high-stakes text extraction pipeline, CE is low-hanging fruit. You likely already have access to multiple VLMs via API. Add CE as a post-processing layer: compute edit-distance entropy across model outputs, flag high-entropy results for human review or rerouting, and use weighted ensemble for the rest. No retraining. No labeled data. The paper shows you can route only 7.3% of samples and still capture the gains.
The broader connection: this paper extends the self-consistency-chain-of-thought-reasoning idea from single-model sampling to cross-model consensus, and from reasoning to perception. Self-consistency showed that majority voting on reasoning paths improves over single-sample greedy decoding. CE shows that entropy-weighted agreement across different models is a stronger signal than majority voting within one model — because the errors are more independent. Think of it as the ensemble-methods insight applied to quality verification rather than just prediction averaging. The ensemble’s value isn’t just the average — it’s that disagreement is informative.
When multiple independent experts agree, trust the answer. When they disagree, route to an expert or flag for review. This is how human review pipelines work. CE automates that instinct.
Paper: arXiv:2504.11101 — Zhang et al. — 2025
Connections
- ensemble-methods — CE-Ensemble is a confidence-weighted ensemble where weights come from inter-model agreement
- uncertainty-estimation — CE quantifies output uncertainty without labels or model internals
- self-consistency — CE extends self-consistency from single-model sampling to cross-model consensus
- vision-language-models — CE is designed for VLM outputs and evaluated across 210 VLMs
- self-consistency-chain-of-thought-reasoning — the single-model sampling baseline CE improves upon
Citation
Zhang, Y., Liang, T., Huang, X., Cui, E., Wang, G., Guo, X., Li, C., & Liu, G. (2025). Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR. arXiv preprint. https://arxiv.org/abs/2504.11101