reddit.com via Reddit

r/artificial: Claude Fable 5 Security Guardrails Bypassed With Fake Homework Assignment — Educational Framing Overrides Hard Blocks on Exploit Assistance

anthropic safety cybersecurity jailbreak ai-safety security

Summary

A developer reports that Claude Fable 5's new hard blocks on security-related queries can be bypassed by framing exploit requests as homework on a Metasploitable2 VM — a deliberately vulnerable training target — with the academic framing causing the safety classifiers to permit detailed vulnerability exploitation guidance. The post details that Fable 5 launched with tighter restrictions than prior Claude models on cybersecurity content, but that surface-level intent signals appear to override content-level analysis. Comments in the thread corroborate the technique with variations using fictional and research contexts, suggesting the guardrail relies on declared purpose rather than semantic content inspection.