CVE-Bench: gpt-5.5 Fixes Just 50% of Real Python CVEs
Key insights
- CVE-Bench's top agent, gpt-5.5, fixed only 50% of 20 real Python CVEs overall, rising to 60% under optimal advisory conditions.
- The most dangerous failure mode produces test-passing patches that leave the original vulnerability intact with no visible error signal.
- gpt-5.5 costs roughly 12 times more per run than gpt-5.4-mini for statistically equivalent outcomes within the OpenAI family.
Why this matters
Automated security patching is entering developer pipelines at scale, and CVE-Bench shows the best available agent fails on half of real-world CVEs, meaning silent vulnerabilities can ship with a false confirmation attached. The test-passing-but-vulnerable failure mode is structurally invisible to standard CI pipelines, so no alert fires when a broken fix is committed and merged. The roughly 12x cost premium for gpt-5.5 over gpt-5.4-mini buys no statistical improvement in patch quality, leaving teams with no cost-performance trade-off that yields trustworthy automated patching.
Summary
CVE-Bench, Giovanni Gatti Pinheiro's benchmark, tested five LLM agents on 20 real Python CVEs. The best model, gpt-5.5, fixed only 50% overall and 60% under optimal advisory conditions.
Four failure modes recur across all models, the most dangerous being a patch that passes every regression test while leaving the original vulnerability intact with no error signal.
Essentially: (gpt-5.5, laguna-m.1) are the top agents tested, yet both fall well short of reliable patching.
- OpenAI-vs-Laguna gaps are statistically significant; within-family gaps were noise.
- gpt-5.5 costs roughly 12x more per run than gpt-5.4-mini for equivalent outcomes.
- 16 of 20 CVEs were disclosed after March 2026, limiting contamination.
The silent pass-but-fail mode is what makes LLM patching dangerous: no error reaches the security team.
Potential risks and opportunities
Risks
- Development teams using gpt-5.5 in automated patching pipelines could unknowingly ship vulnerabilities with CVSS scores up to 9.8, with no visible indication anything is wrong reaching the developer or security team.
- Security organizations that treat passing regression tests as proof of a complete patch will have no mechanism to catch the silent-failure mode CVE-Bench identifies as the most operationally dangerous.
- Teams switching from gpt-5.5 to gpt-5.4-mini for cost savings inherit statistically equivalent patch quality, including the same silent-failure rate, with no quality signal differentiating the two models.
Opportunities
- The silent-failure gap creates demand for patch verification tooling that reruns exploit scenarios independently after an LLM fix, a layer absent from every pipeline the CVE-Bench study describes.
- gpt-5.4-mini's statistical parity with gpt-5.5 at roughly 12x lower cost makes it the default cost-efficient baseline for teams building automated patching pipelines on OpenAI-family models.
- CVE-Bench's open methodology gives security evaluation vendors a replicable framework to extend coverage beyond 20 Python CVEs and offer continuous LLM patch-quality benchmarking as a service.
What we don't know yet
- Whether the test-passing silent-failure mode can be detected by any existing static analysis or exploit-replay tooling without changes to the patching pipeline.
- How agent performance scales across non-Python or multi-language CVEs, explicitly excluded from CVE-Bench's 20-CVE dataset.
- What architectural or training differences explain the statistically significant cross-family gap between OpenAI and Laguna models confirmed by McNemar's test.
Originally reported by giovannigatti.github.io
Read the original article →Original headline: r/LocalLLaMA: CVE-Bench Finds LLM Agents' Worst Security-Patch Outcome Is a Convincing Fix That Passes All Tests but Ships Vulnerability Intact