giovannigatti.github.io via Reddit

CVE-Bench: gpt-5.5 Fixes Just 50% of Real Python CVEs

openai cybersecurity coding tools ai-security coding

Key insights

  • CVE-Bench's top agent, gpt-5.5, fixed only 50% of 20 real Python CVEs overall, rising to 60% under optimal advisory conditions.
  • The most dangerous failure mode produces test-passing patches that leave the original vulnerability intact with no visible error signal.
  • gpt-5.5 costs roughly 12 times more per run than gpt-5.4-mini for statistically equivalent outcomes within the OpenAI family.

Why this matters

Automated security patching is entering developer pipelines at scale, and CVE-Bench shows the best available agent fails on half of real-world CVEs, meaning silent vulnerabilities can ship with a false confirmation attached. The test-passing-but-vulnerable failure mode is structurally invisible to standard CI pipelines, so no alert fires when a broken fix is committed and merged. The roughly 12x cost premium for gpt-5.5 over gpt-5.4-mini buys no statistical improvement in patch quality, leaving teams with no cost-performance trade-off that yields trustworthy automated patching.

Summary

CVE-Bench, Giovanni Gatti Pinheiro's benchmark, tested five LLM agents on 20 real Python CVEs. The best model, gpt-5.5, fixed only 50% overall and 60% under optimal advisory conditions. Four failure modes recur across all models, the most dangerous being a patch that passes every regression test while leaving the original vulnerability intact with no error signal. Essentially: (gpt-5.5, laguna-m.1) are the top agents tested, yet both fall well short of reliable patching. - OpenAI-vs-Laguna gaps are statistically significant; within-family gaps were noise. - gpt-5.5 costs roughly 12x more per run than gpt-5.4-mini for equivalent outcomes. - 16 of 20 CVEs were disclosed after March 2026, limiting contamination. The silent pass-but-fail mode is what makes LLM patching dangerous: no error reaches the security team.

Potential risks and opportunities

Risks

  • Development teams using gpt-5.5 in automated patching pipelines could unknowingly ship vulnerabilities with CVSS scores up to 9.8, with no visible indication anything is wrong reaching the developer or security team.
  • Security organizations that treat passing regression tests as proof of a complete patch will have no mechanism to catch the silent-failure mode CVE-Bench identifies as the most operationally dangerous.
  • Teams switching from gpt-5.5 to gpt-5.4-mini for cost savings inherit statistically equivalent patch quality, including the same silent-failure rate, with no quality signal differentiating the two models.

Opportunities

  • The silent-failure gap creates demand for patch verification tooling that reruns exploit scenarios independently after an LLM fix, a layer absent from every pipeline the CVE-Bench study describes.
  • gpt-5.4-mini's statistical parity with gpt-5.5 at roughly 12x lower cost makes it the default cost-efficient baseline for teams building automated patching pipelines on OpenAI-family models.
  • CVE-Bench's open methodology gives security evaluation vendors a replicable framework to extend coverage beyond 20 Python CVEs and offer continuous LLM patch-quality benchmarking as a service.

What we don't know yet

  • Whether the test-passing silent-failure mode can be detected by any existing static analysis or exploit-replay tooling without changes to the patching pipeline.
  • How agent performance scales across non-Python or multi-language CVEs, explicitly excluded from CVE-Bench's 20-CVE dataset.
  • What architectural or training differences explain the statistically significant cross-family gap between OpenAI and Laguna models confirmed by McNemar's test.