Hermes Skill Pits Three AI Models in Adversarial Debate
Key insights
- Three AI models independently answer and debate before RRF and Borda Count voting aggregate their outputs into a single consensus response.
- The system required 136 API calls and 16.8 million tokens per run, making compute cost a significant constraint for production adoption.
- The entire Hermes skill was constructed by AI agents, demonstrating AI systems building their own multi-model evaluation infrastructure.
Why this matters
Adversarial multi-model architectures represent a structural alternative to prompt engineering for improving answer reliability, shifting quality control from input design to output arbitration across competing models. The 136-call, 16.8M-token footprint per query signals that adversarial consensus carries real compute costs that teams must model before deploying it in production pipelines alongside latency-sensitive workloads. The agent-built construction is itself a measurable signal: AI systems are now generating the evaluation infrastructure used to validate other AI systems, compressing the human role in tooling development faster than most engineering organizations have planned for.
Summary
Three AI models independently answering a question before debating each other's responses is now a working Hermes skill. A developer published the build: models compete, argue adversarially, and their outputs are aggregated using Reciprocal Rank Fusion and Borda Count voting to surface a final ranked answer.
The system consumed 136 API calls and 16.8 million tokens per run. The build itself was constructed entirely by AI agents, making it a demonstration of AI generating its own multi-model evaluation infrastructure without direct human coding.
Essentially: one developer, using AI-built tooling, shipped a structured disagreement engine designed to produce more robust answers on contested factual and analytical questions.
- Three models independently answer, then debate each other's outputs before the ranking stage begins.
- RRF and Borda Count aggregate the competing ranked responses into a consensus final answer.
- Agent-built construction means no human wrote the skill code directly.
Single-model pipelines let confident wrong answers go unchallenged; this architecture treats disagreement as the mechanism rather than the noise to suppress.
Potential risks and opportunities
Risks
- Teams adopting adversarial consensus pipelines without benchmarking accuracy gains could burn API budgets on 16.8M-token runs that outpace marginal answer quality improvements.
- AI-built infrastructure lacking auditable construction history creates maintenance and debugging risk when underlying model APIs change versions or deprecate endpoints.
- Borda Count and RRF aggregation assume model outputs are meaningfully independent; if the three models share training data or RLHF pipelines, the adversarial benefit collapses silently without any visible failure signal.
Opportunities
- LLM inference providers (Together AI, Fireworks, Groq) can position low-latency batch inference as the cost layer that makes 136-call adversarial architectures economically viable at scale.
- Evaluation and observability platforms (Braintrust, LangSmith, Arize) have a clear product path adding RRF and Borda Count as native aggregation methods for multi-model pipeline outputs.
- Enterprise teams building high-stakes RAG systems in legal, medical, and financial verticals can adopt this pattern to add a structured disagreement layer before surfacing answers to end users.
What we don't know yet
- Whether accuracy gains on contested factual questions have been benchmarked against a single-model baseline using an equivalent token budget.
- Which three specific models were used, and whether using variants from the same provider family undermines the adversarial independence the design depends on.
- Latency per query under the 136-call architecture, which determines viability for any production use case requiring near-real-time responses.
Originally reported by reddit.com
Read the original article →Original headline: r/AI_Agents: Developer Builds Hermes Skill Where 3 AI Models Argue Adversarially Before Surfacing an Answer — RRF + Borda Count Ranking, 136 API Calls, Entirely Agent-Built