reddit.com via Reddit May 21st 2026

Open-Source LLMs Still Crack Under Long Reasoning Jailbreaks

open source safety cybersecurity jailbreaks open-source safety

Key insights

All 10 tested open-source models remained vulnerable to multi-turn jailbreaks even after lightweight defenses were applied.
Attack success rates decreased as model capability increased, but no model achieved reliable resistance to long reasoning jailbreaks.
The 167-scenario benchmark covered both prompt-injection (94) and jailbreak (73) attack types across six major model families.

Why this matters

Enterprises deploying open-source models in agentic or multi-turn workflows face a documented, quantified attack surface that lightweight guardrails do not close, meaning current production deployments may be more exposed than security reviews have assumed. The inverse scaling relationship between capability and attack difficulty creates a false sense of safety as organizations upgrade to more powerful open-source models. For AI safety teams and red-teamers, this research establishes a concrete, reproducible benchmark that competing defense proposals will now need to beat to be taken seriously.

Summary

An ACM paper testing 10 open-source models across 167 attack scenarios finds that long reasoning-based jailbreaks remain a persistent, largely unsolved problem even when lightweight defenses are applied. Researchers put Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen, and Gemma variants through 94 prompt-injection scenarios and 73 jailbreak scenarios. The attack class that proved most durable works by gradually shifting context across multiple conversational turns rather than launching a single adversarial prompt, exploiting the way these models maintain and update reasoning state over extended exchanges. Essentially: (Meta, Microsoft, Mistral AI, DeepSeek, Alibaba, Google) are all shipping models that share the same structural weakness. - Lightweight defenses measurably reduced attack success rates but none eliminated vulnerability to multi-turn reasoning manipulation. - Attack difficulty scaled inversely with model capability, meaning stronger models were harder but not impossible to break. - The 167-scenario test suite is unusually broad, making this one of the more systematic evaluations of open-source model robustness published to date. The finding reframes the safety gap less as a content-filter problem and more as a fundamental challenge in how long-context reasoning models maintain alignment across extended, adversarially steered conversations.

Potential risks and opportunities

Risks

Organizations running DeepSeek-R1 or Llama 3.2 in customer-facing or agentic pipelines could face exploitation via gradual context-shifting attacks before effective defenses are standardized and deployed.
Model providers (Meta, Mistral AI, Alibaba/Qwen, Google/Gemma) may face regulatory scrutiny in the EU AI Act compliance cycle if this benchmark is adopted as an evaluation standard, given documented persistent vulnerabilities.
Security auditors and compliance teams relying on single-turn red-teaming to clear open-source deployments are systematically underestimating risk, potentially leaving enterprise customers exposed without knowing it.

Opportunities

Multi-turn adversarial defense vendors and research groups (Robust Intelligence, HiddenLayer, Lakera) gain a concrete benchmark to validate and market defenses against, with a clear competitive differentiation path.
Enterprises with dedicated AI red-team capabilities can offer customers a materially higher assurance tier by extending evaluations to multi-turn long-reasoning attack scenarios, creating a new service line.
Open-source model maintainers at Meta, Mistral, and Alibaba who invest in fine-tuning specifically against the 167-scenario suite could use demonstrated benchmark improvements as a trust and adoption signal in the enterprise sales cycle.

What we don't know yet

Whether the same multi-turn reasoning attack patterns transfer to closed frontier models (GPT-4o, Claude, Gemini) at comparable rates, which the paper did not test.
Specific ASR (Attack Success Rate) numbers per model family are not surfaced in the Reddit summary, limiting direct vendor accountability comparisons.
Whether the 167-scenario suite has been made publicly available for independent replication or is gated behind the ACM paywall as of May 2026.

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: ACM Paper Finds Open-Source LLMs Remain Highly Vulnerable to Long Reasoning Jailbreaks Across 10 Models and 167 Attack Scenarios Even With Lightweight Defenses