arxiv.org via Reddit

ISR Gate Drops Evidence-QA Hallucination to 0.7%

hallucinations rag hallucinations rag

Key insights

  • Permutation-induced dispersion in evidence QA scales O(log n) with chunk count; Qwen2-7B slope was 0.377 with R² = 0.742.
  • The ISR abstention gate achieved 0.0-0.7% hallucination on a 528-item Gemma-2-9B held-out audit at 20.6-27.9% abstention cost.
  • Increasing evidence dose from 0 to 3 reduced hallucination by 17.6 percentage points with Spearman ρ = -0.80 (p<0.001).

Why this matters

Evidence-grounded QA underpins legal discovery tools, regulatory compliance pipelines, and clinical decision support, and this paper demonstrates that hallucination rates in these systems are predictably tied to information budget. The ISR gate result (0.0-0.7% hallucination at 24.1% abstention on Gemma-2-9B) gives practitioners a concrete, deployable quality floor. The O(log n) dispersion law reframes order-sensitivity from an unpredictable quirk into a formally engineerable signal, making hallucination risk quantifiable before deployment.

Summary

LLMs used for evidence-grounded adjudication produce measurably different answers depending on document order, and Chlon et al. show this variance follows O(log n) scaling, not random noise. On Qwen2-7B, permutation dispersion slope reached 0.377 (R² = 0.742). Increasing evidence dose from 0 to 3 dropped hallucination by 17.6 percentage points, with Spearman ρ = -0.80 linking information budget directly to hallucination rate. Essentially: (Leon Chlon, Ahmed Karim, Maggie Chlon, MarcAntonio Awada) show order-sensitivity in evidence QA is measurable and gateable. - The ISR abstention gate on 528 held-out items achieved 0.0-0.7% hallucination at 20.6-27.9% abstention. - Full evaluation covered 3,059 items across FEVER, HotpotQA, NQ-Open, and PopQA. Hallucination in evidence-grounded QA is not a black-box failure mode; it is a measurable information budget problem.

Potential risks and opportunities

Risks

  • Legal and compliance teams deploying RAG-based document review without ISR-style gating face unknown hallucination rates; the 0.0-0.7% floor only applies under controlled 48-token chunk conditions.
  • The 24.1% abstention rate required to reach the 0.7% hallucination ceiling may be operationally unacceptable in high-throughput enterprise pipelines, forcing tradeoffs between coverage and reliability.
  • Results were validated on FEVER, HotpotQA, NQ-Open, and PopQA only; organizations in domains outside these benchmarks cannot assume ISR gate dispersion bounds transfer.

Opportunities

  • Enterprise legal tech and compliance vendors building RAG pipelines can integrate the ISR gate as a drop-in quality layer with a quantifiable 0.0-0.7% hallucination SLA on evidence-grounded decisions.
  • AI evaluation and auditing firms gain a new metric framework (QMV, EDFL, ISR) for offering formal hallucination-rate guarantees on evidence-grounded QA, differentiating from generic benchmark-based audits.
  • LLM API providers can differentiate on order-agnostic inference quality by surfacing ISR scores or permutation-averaged outputs as an optional reliability tier for document-grounded use cases.

What we don't know yet

  • Whether the ISR gate generalizes beyond the 7-9B models tested (Qwen2-7B, Llama-3.1-8B, Gemma-2-9B) to frontier-scale models is unaddressed.
  • Production latency cost of the permutation sampling approach is unevaluated; Experiment 1 alone required 97,888 forward passes across two models.
  • Whether the O(log n) dispersion law holds when evidence chunks exceed the 48-token cap used throughout the study is an open empirical question.