predictivedefense.io via Reddit

Predictive Defense: AI Analyst Agents Invent Evidence

agents hallucinations cybersecurity ai-agents intelligence hallucinations

Key insights

  • Dimyanoglu's six-agent pipeline produced a final report with APT41 and Volt Typhoon content absent from collected intelligence.
  • Trust scores and verbatim-retrieval requirements functioned as model instructions, not enforceable constraints on output grounding.
  • LLM training data is dominated by vendor threat reports and marketing material that passes as authoritative intelligence.

Why this matters

Any agent system that depends on retrieval-based grounding faces the same structural risk Dimyanoglu documents: the model will silently substitute training data for missing evidence without flagging the gap, making the failure invisible in the final output. The failure is architectural rather than a prompt-engineering problem, meaning practitioners cannot fix it by refining instructions or switching to a more capable model alone. Because LLM training data is disproportionately composed of vendor threat reports and high-profile APT narratives, the fabricated content produced by these systems sounds authoritative and is difficult to distinguish from legitimately collected intelligence.

Summary

Robin Dimyanoglu at Predictive Defense built a six-agent pipeline to automate structured intelligence analysis and documented its failure at every layer. Tested on a single query about Chinese cyber operations against Taiwan, the system's final report referenced APT41, PLA Strategic Support Force coordination, and Volt Typhoon pre-positioning against Taiwanese infrastructure. None of that content appeared in the collected intelligence. The model pulled it from training data, presenting fabricated scenarios with the same confidence it would cite real sources. Essentially: (Predictive Defense, Gemini-Flash) the pipeline's trust scores and verbatim-retrieval rules were instructions to the model, not actual constraints on its output. - Intake generated heavily overlapping sub-questions that all required identical information, undermining the collection design from the start. - Collection tasks were too broad to execute meaningfully, and included instructions that failed to blind the collector. - The final report was structurally correct but analytically hollow, its coherent scenarios entirely ungrounded in the retrieved material. Dimyanoglu concludes the core problems are hard in a way that better prompts alone won't fix, and explicitly flags model selection (Gemini-Flash) and scaffolding as unresolved variables.

Potential risks and opportunities

Risks

  • Intelligence teams using LLM agent pipelines for threat reporting could publish APT-attribution assessments built on model-fabricated content, driving policy or response decisions on hallucinated evidence.
  • Security vendors shipping AI-assisted analysis products face reputational and liability exposure if customers rely on reports where training-data content is presented as collected intelligence.
  • Systematic over-representation of Volt Typhoon and APT41 narratives in training data could bias AI-generated threat assessments toward well-documented actors, masking novel or less-publicized threat activity.

Opportunities

  • RAG evaluation and output-grounding vendors have a concrete failure case to anchor product pitches: structured pipelines with explicit source requirements still fabricate when grounding is enforced only as a prompt instruction.
  • Intelligence firms offering human-analyst review in hybrid workflows can use this case study to differentiate against fully automated competitors, positioning human gap-detection as the only reliable safeguard currently available.
  • Model providers that develop verifiable citation modes — flagging output content not traceable to the provided context window — directly address the structural gap Dimyanoglu identifies as unsolved by prompting alone.

What we don't know yet

  • Whether the grounding failure reproduces across more capable models than Gemini-Flash, which Dimyanoglu explicitly flagged as an untested variable.
  • No evaluation benchmark exists for measuring grounding compliance in multi-agent intelligence pipelines, leaving practitioners no systematic way to detect training-data substitution.
  • The article does not address whether human-in-the-loop review at the collection stage would catch gap-filling before it reaches the final analytical report.