arxiv.org web signal

Vera claims 93.9% attack success on LLM agent frameworks

TL;DR

  • The Vera framework reports average attack success rates of 93.9% against four production agent frameworks under multi-channel attacks.
  • Vera-Bench ships 1,600 executable safety cases spanning 124 risk categories, covering OpenClaw, Hermes, Codex, and Claude Code.
  • Verifiers judge outcomes using environment state and tool-call evidence rather than the agent's own self-report of what happened.

A new arxiv preprint called Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification puts a number on how brittle production agent frameworks still are: an average attack success rate of 93.9% under what the authors call multi-channel attacks. The systems tested include OpenClaw, Hermes, Codex, and Claude Code.

The framework the authors propose, Vera, runs a three-stage pipeline. Literature-driven exploration surfaces emerging risks and structures them into taxonomies. Combinatorial composition across those taxonomy dimensions builds executable safety cases with concrete goals and verification predicates. Adaptive execution then runs agents in isolated sandboxes, where a control agent steers multi-turn interaction based on runtime observations.

The interesting bit for practitioners is the verification step. Rather than asking the agent to self-report whether it did something unsafe, Vera judges outcomes using environment state and tool-call evidence rather than model self-report. That distinction matters because a lot of published safety numbers on agents lean on the model's own claim about what it did, which is exactly the wrong thing to trust when the model is under adversarial pressure.

The authors also released Vera-Bench, 1,600 executable safety cases spanning 124 risk categories. The honest caveat is that 93.9% is a single average across four frameworks, and the abstract doesn't break it down by system or by category, so take the specifics as reported, not settled. This is also an author-run evaluation of the authors' own attack methodology on other people's systems, which is exactly the setup where the numbers tend to look most impressive.

What the reporting doesn't give you is the per-framework leaderboard, the base rate for the safest configuration, or how much of the attack surface comes from tool wiring versus prompt-level exploits. For teams shipping agents, the practical move is probably to grab Vera-Bench and rerun it against your own harness, rather than reading the 93.9% as a fixed property of any named system.

Shared on Bluesky by 2 AI experts