the-decoder.com web signal

Claude Mythos doubles GPT-5.5 score on real V8 exploits

anthropic openai cybersecurity ai-security autonomous-exploitation browser-exploits

Key insights

  • Claude Mythos Preview scored 9.90/16 on ExploitBench, achieving full code execution on 21 of 41 real V8 vulnerabilities.
  • Mythos outscored GPT-5.5 nearly 2-to-1 and held a 9.55 score in fully autonomous mode with zero human guidance.
  • ExploitBench targets Chrome, Edge, Node.js, and Cloudflare Workers using real unpatched vulnerabilities, not synthetic benchmarks.

Why this matters

Frontier AI can now autonomously develop working browser exploits at a scale and speed that outpaces most human security researchers, compressing the window between vulnerability discovery and working code execution. The benchmark's five-tier structure, culminating in arbitrary code execution against Chrome and Cloudflare Workers, shows that AI capability transfers to real production infrastructure rather than remaining confined to controlled lab puzzles. For security teams at browser vendors, cloud providers, and any organization running V8-dependent workloads, the threat model now includes AI-assisted zero-day development as a credible near-term attack vector, not a theoretical future concern.

Summary

Carnegie Mellon's ExploitBench tested AI agents against 41 real V8 vulnerabilities. Claude Mythos Preview scored 9.90/16, achieving full code execution on 21 bugs, nearly double GPT-5.5's 5.51. The benchmark spans five tiers ending at arbitrary code execution against Chrome, Edge, Node.js, and Cloudflare Workers. Mythos held 9.55 in fully autonomous mode, solving bugs that human experts had previously abandoned. Essentially: (Anthropic, OpenAI) are producing models that function as credible autonomous offensive security researchers against production browser infrastructure. - Mythos cost ~$36,428 to test versus $3,075 for GPT-5.5, a 12x cost gap for roughly 2x the performance. - Researchers describe Mythos as operating 'like a fairly competent browser/JS engine security researcher.' - The benchmark uses real, unpatched vulnerabilities rather than synthetic CTF-style problems. V8 underpins Chrome, Node.js, and Cloudflare Workers, meaning AI exploit capability in this space has direct production reach.

Potential risks and opportunities

Risks

  • Google Chrome and Microsoft Edge security teams face a compressed zero-day window as state-level actors and well-funded criminal groups gain access to Mythos-equivalent models, potentially outpacing V8 patch cycles within 12-18 months
  • Cloudflare Workers and Node.js server operators face elevated exposure if autonomous V8 exploit pipelines reach non-state threat actors before browser vendors ship mitigations for the 21 bugs Mythos successfully cracked
  • AI labs publishing detailed exploit benchmark methodology without coordinated disclosure risk providing a replication roadmap that lowers the skill floor for developing production-grade browser exploits

Opportunities

  • Browser vendors (Google, Mozilla, Apple) have a near-term window to fund automated V8 fuzzing and patch-velocity programs before Mythos-class capability reaches commodity API pricing
  • Offensive security firms (Crowdstrike, Rapid7, Synack) can productize AI-assisted browser exploit development for authorized red-teaming engagements, with ExploitBench scores serving as a credible third-party capability signal
  • Cyber insurance underwriters pricing browser remote-code-execution exposure (Coalition, At-Bay) now have a concrete benchmark for modeling AI-assisted attack frequency against V8-dependent infrastructure, enabling more precise premium adjustments

What we don't know yet

  • Whether Google, Microsoft, and Cloudflare were notified about the specific V8 vulnerabilities Mythos successfully exploited before the benchmark results were published
  • The cost trajectory for autonomous Mythos-class exploit development: at $36,428 per benchmark run, accessibility is currently limited, but the research does not address when this reaches commodity pricing
  • Whether the 20 bugs Mythos failed to exploit reveal systematic capability gaps or simply reflect per-bug API cost constraints that a well-funded actor could eliminate