Josef Chen Shows the Wrong Metric Caps Multi-Model LLM Gains
TL;DR
- Any routing, voting, or cascade policy is bounded by 1-β, the rate all pool models fail simultaneously, not by pairwise error correlation.
- Pairwise error correlation ρ, the field's standard ensemble diagnostic, is mathematically blind to the co-failure rate β and cannot predict gains.
- Across 67 frontier models, a correctly calibrated model underprices the all-wrong tail by about 2.5× on open-ended mathematics.
Building model ensembles has become standard practice: route queries across multiple LLMs, vote on results, cascade from cheap to strong, and the combined system is assumed to beat any single model. A paper from Josef Chen of KAIKAKU argues the field has been optimizing for the wrong quantity, and provides a theorem and a free measurement to demonstrate it.
The ceiling result is straightforward. For any selection policy whose output is one of the member models' answers -- a router, a majority vote, a cascade -- accuracy cannot exceed 1-β, where β is the rate at which every model in the pool is wrong on the same query. The field instead reports ρ, the average pairwise error correlation between models. Chen proves mathematically that ρ cannot identify β: error distributions with identical marginals and identical pairwise correlations can still differ in their co-failure rate. A Clopper-Pearson bound on β, computed from one graded query set at no additional inference cost, provides a ceiling certificate on the maximum gain any such policy can deliver before a router is trained.
To measure how far the gap runs in practice, Chen tested 67 frontier models from 21 providers, including GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, Grok-4.3, DeepSeek V4, Qwen3.7-Max, and Kimi K2.7, against hard open-ended benchmarks. On MATH-500, over 330 fully-covered queries, the all-wrong rate was β=0.052 -- resting on only 17 all-wrong events. A correctly calibrated single-factor model predicted only 0.021, about 2.5× below the observed tail (bootstrap 90% CI 1.7 to 3.4×). The same signature appears on execution-graded competitive programming, where β=0.079. The underpricing grows monotonically with pool size, and the paper traces the residual to a common-mode pattern -- problems where the entire frontier simultaneously fails -- that no pairwise statistic can represent.
The honest caveats are real. The all-wrong event counts are small (k=17 on MATH-500, k=5 on code), which means confidence intervals are wide and the magnitude of underpricing carries real uncertainty. The co-failure tail largely vanishes on multiple-choice tasks, suggesting the effect tracks open-ended task format rather than subject matter. The paper claims no new routing algorithm -- only a measurement and a certificate.
What the work offers practitioners is a zero-cost pre-deployment instrument: compute β from a held-out graded set, bound it with Clopper-Pearson, and check whether the implied ceiling over the single best model exceeds the orchestration overhead. According to the paper, combining models rarely beats the single best model without a strong query-level routing signal on hard open-ended tasks; the gains come from models that fail on different questions rather than from adding more of them.
Originally reported by huggingface.co
Read the original article →Original headline: Co-Failure Ceiling: New Paper Proves Multi-Model LLM Ensembles Are Bounded by All-Wrong Rate Across 67 Frontier Models — Invalidates Field's Standard Metric