giovannigatti.github.io via Reddit June 1st 2026

CVE-Bench: gpt-5.5 Fixes Just 50% of Real Python CVEs

openai cybersecurity coding tools ai-security coding

Key insights

CVE-Bench's top agent, gpt-5.5, fixed only 50% of 20 real Python CVEs overall, rising to 60% under optimal advisory conditions.
The most dangerous failure mode produces test-passing patches that leave the original vulnerability intact with no visible error signal.
gpt-5.5 costs roughly 12 times more per run than gpt-5.4-mini for statistically equivalent outcomes within the OpenAI family.

Why this matters

Automated security patching is entering developer pipelines at scale, and CVE-Bench shows the best available agent fails on half of real-world CVEs, meaning silent vulnerabilities can ship with a false confirmation attached. The test-passing-but-vulnerable failure mode is structurally invisible to standard CI pipelines, so no alert fires when a broken fix is committed and merged. The roughly 12x cost premium for gpt-5.5 over gpt-5.4-mini buys no statistical improvement in patch quality, leaving teams with no cost-performance trade-off that yields trustworthy automated patching.

Summary

CVE-Bench, Giovanni Gatti Pinheiro's benchmark, tested five LLM agents on 20 real Python CVEs. The best model, gpt-5.5, fixed only 50% overall and 60% under optimal advisory conditions. Four failure modes recur across all models, the most dangerous being a patch that passes every regression test while leaving the original vulnerability intact with no error signal. Essentially: (gpt-5.5, laguna-m.1) are the top agents tested, yet both fall well short of reliable patching. - OpenAI-vs-Laguna gaps are statistically significant; within-family gaps were noise. - gpt-5.5 costs roughly 12x more per run than gpt-5.4-mini for equivalent outcomes. - 16 of 20 CVEs were disclosed after March 2026, limiting contamination. The silent pass-but-fail mode is what makes LLM patching dangerous: no error reaches the security team.

Potential risks and opportunities

Risks

Development teams using gpt-5.5 in automated patching pipelines could unknowingly ship vulnerabilities with CVSS scores up to 9.8, with no visible indication anything is wrong reaching the developer or security team.
Security organizations that treat passing regression tests as proof of a complete patch will have no mechanism to catch the silent-failure mode CVE-Bench identifies as the most operationally dangerous.
Teams switching from gpt-5.5 to gpt-5.4-mini for cost savings inherit statistically equivalent patch quality, including the same silent-failure rate, with no quality signal differentiating the two models.

Opportunities

The silent-failure gap creates demand for patch verification tooling that reruns exploit scenarios independently after an LLM fix, a layer absent from every pipeline the CVE-Bench study describes.
gpt-5.4-mini's statistical parity with gpt-5.5 at roughly 12x lower cost makes it the default cost-efficient baseline for teams building automated patching pipelines on OpenAI-family models.
CVE-Bench's open methodology gives security evaluation vendors a replicable framework to extend coverage beyond 20 Python CVEs and offer continuous LLM patch-quality benchmarking as a service.

What we don't know yet

Whether the test-passing silent-failure mode can be detected by any existing static analysis or exploit-replay tooling without changes to the patching pipeline.
How agent performance scales across non-Python or multi-language CVEs, explicitly excluded from CVE-Bench's 20-CVE dataset.
What architectural or training differences explain the statistically significant cross-family gap between OpenAI and Laguna models confirmed by McNemar's test.

Originally reported by giovannigatti.github.io

Read the original article →

Original headline: r/LocalLLaMA: CVE-Bench Finds LLM Agents' Worst Security-Patch Outcome Is a Convincing Fix That Passes All Tests but Ships Vulnerability Intact