EDV Framework Curbs AI Agent Memory Contamination with Consensus
TL;DR
- EDV achieves 86.6% Pass@1 on τ²-bench, outperforming the best prior ensemble baseline of 83.5%.
- Injecting 10% erroneous memories into a single-agent baseline causes a 5.3 percentage point performance drop, the core vulnerability EDV targets.
- Single-agent baselines stagnate or decline across training epochs; EDV improves consistently from 0.810 Pass@1 at epoch 1 to 0.909 by epoch 4.
When an AI agent both executes a task and grades its own performance, you get a compounding problem. A flawed trajectory that is internally consistent gets filed into shared memory as a success, and every agent that later retrieves it inherits the mistake. Researchers from Zhejiang University, Tsinghua University, Northwestern Polytechnical University, and several other institutions call this the Self-Confirmation Trap, and a new paper on Hugging Face proposes a structural fix: EDV, an Execute-Distill-Verify framework that assigns execution, summarization, and validation to distinct roles rather than letting a single agent handle all three.
The design works as follows. In the Execute stage, multiple heterogeneous agents tackle the same task in parallel, generating diverse candidate trajectories. A designated third-party distillation agent then compares those trajectories and drafts candidate experiences — this agent is not one of the executors, which breaks the self-referential loop the paper targets. In the Verify stage, the execution group votes: unanimous approval writes an experience to shared memory, partial approval limits it to the approving agent's private memory, and rejection discards it entirely.
On τ²-bench, a benchmark covering customer-service scenarios across airline, retail, and telecom domains, EDV achieves 86.6% Pass@1 against 83.5% for the best prior ensemble baseline. The contamination sensitivity analysis sharpens the motivation: injecting just 10% erroneous memories into a single-agent baseline causes a 5.3 percentage point drop in the retail domain — roughly the gap EDV recovers in normal operation. Memory quality audits on a five-point scale show EDV-generated experiences score 4.41 on groundedness versus 3.72 for the baseline, and 0.63 on noise/hallucination versus 1.21. Training convergence adds another layer: the single-agent baseline stagnates or declines across epochs while EDV improves consistently from 0.810 at epoch 1 to 0.909 by epoch 4. Gains extend to Mind2Web web navigation and MMTB multi-mission tool tasks as well.
The honest caveats: the model pool draws entirely from Chinese LLM providers — Mimo-V2-Flash, GLM-4.7-FP8, and MiniMax-M2.1 — and whether the gains transfer to other model families is not tested. The three benchmarks are structured, human-verifiable environments; open-ended real-world deployments may surface different failure modes. Running multiple agents in parallel adds inference overhead, though the paper reports a 24.5% reduction in token consumption versus the single-agent ReasoningBank baseline on the retail subset.
For teams building agentic systems with persistent memory, the core architectural lesson holds regardless of which models are used: the agent that runs the task should not be the sole authority on whether it succeeded. Code is available at github.com/shidingz/EDV.
Originally reported by huggingface.co
Read the original article →Original headline: EDV Framework Solves AI Agent 'Self-Confirmation Trap' — Multi-Agent Consensus Verification Cuts Memory Contamination 5.3pp, 86.6% Pass@1 on τ²-Bench