paper web signal

Mid-Network Entropy Probe Flags Jailbreaks Without Retraining

TL;DR

  • Jailbreak signal concentrates near 69% layer depth on Llama-3.1-8B, Qwen3-8B, and Gemma-7b, not at the model output.
  • Monotonicity entropy feature reached mean AUROC of 0.941 on Llama and Qwen with zero classifier training required.
  • Against adversarially-crafted benign JailbreakBench prompts, Llama AUROC collapsed from 0.941 to 0.348, revealing a key robustness gap.

Most jailbreak defenses focus on what goes into a model or what comes out. This paper, accepted at ECML PKDD 2026 and authored by Sofiia Nikolenko, Michele Papucci, Mina Rezaei, and Shireen Kudukkil Manchingal, focuses instead on what happens in between, and finds the clearest signal there rather than at the output head.

The core finding is that jailbreak prompts do not merely alter overall uncertainty levels but induce structured dynamics in how predictive entropy changes as tokens unfold. Using the logit lens technique, the team projected intermediate hidden states into vocabulary space on frozen models without any retraining, measuring three rank-based entropy trajectory features across eight evenly-spaced probe layers per model. Static aggregate statistics like mean and variance carried little discriminative signal; features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, were substantially more informative. The signal concentrates in intermediate layers and degrades at the final layer, around the 69% depth mark, which is where most detectors do not look.

The numbers hold across three distinct architectures. On Llama-3.1-8B, the monotonicity feature achieved a mean AUROC of 0.941, peaking at 0.999. On Qwen3-8B, the same feature produced a mean AUROC of 0.941 and a peak of 1.000. Gemma-7b reached a mean AUROC of 0.796 under the Kendall τ metric. Crucially, dynamic trend features consistently separated jailbreak from benign prompts across all three architectures without any classifier training, with cross-model standard deviation in Kendall τ of just 0.021, suggesting the result reflects a general property of jailbreaks rather than architecture-specific quirks.

The honest caveat is substantial. When the benign prompt set was replaced with adversarially-crafted safe prompts from JailbreakBench, performance collapsed: Llama mean AUROC dropped to 0.348, Qwen to 0.347, and Gemma to 0.436. That is not a minor dip but a near-failure, showing the method can be defeated by benign-looking prompts constructed to resemble safe traffic. The paper also does not address whether the approach extends to closed-weight models where layer activations are not accessible to external probes, or whether an attacker with knowledge of the entropy signature could design jailbreaks to evade it specifically.

For safety engineers working with open-weight deployments, the practical upside is real: a zero-training probe attached to an already-deployed frozen model is a cheaper first-pass filter than most alternatives. The robustness gap against adversarially-crafted benign inputs is the part that needs solving before this moves from benchmark to production.