reddit.com via Reddit

Gemma 4 Abliteration Variants Ranked on Capability

open source safety google open-source safety benchmarks

Key insights

  • coder3101's variant led all 13 abliterated Gemma 4 E2B entries on capability retention across eight benchmarks.
  • Some abliteration methods destroyed model capability alongside safety filters, creating a wide performance gap among tested variants.
  • This is the first systematic head-to-head comparison of competing abliteration approaches for a current Google open-weight model.

Why this matters

The benchmark gives local AI developers the first rigorous, multi-axis comparison of refusal-removal techniques for a current Google model, replacing guesswork with reproducible data. KL divergence as a selection criterion separates abliteration methods that subtly damage model behavior from those that only affect safety filters, which is critical for any production use case requiring consistent outputs. As open-weight models become standard components in commercial and research pipelines, community-generated quality gates like this will increasingly determine which variants get adopted at scale.

Summary

Thirteen abliterated variants of Google's Gemma 4 E2B have been ranked head-to-head for the first time, following a 44-GPU-hour community benchmark run on a single RTX 5090. The researcher tested each variant against HarmBench safety metrics, KL divergence from the base model, and eight capability benchmarks to identify which refusal-removal techniques preserve model performance. coder3101's variant ranked first on capability retention, separating itself from methods that strip safety filters at the cost of underlying model quality. Essentially: (Google's Gemma 4, r/LocalLLaMA community) the open-weight abliteration ecosystem now produces systematic benchmarking that official channels won't. - coder3101's variant led all 13 entries on capability retention across eight benchmark tasks - KL divergence from the base model measured how much each technique altered learned distributions beyond safety filters - Several variants degraded capability alongside safety filters, revealing a wide quality spread among competing approaches Systematic community benchmarks like this are becoming the de facto quality gate for modified open-weight models in the absence of any official comparison infrastructure.

Potential risks and opportunities

Risks

  • Google could tighten Gemma 4 licensing terms to restrict distribution of abliterated weights, stranding developers who have already built pipelines on coder3101's variant
  • Researchers and companies deploying abliterated variants without independent validation risk capability regressions on tasks not covered by the eight benchmarks in this study
  • HarmBench scores may understate residual refusal-bypass rates in adversarial production settings, creating liability exposure for enterprises using these variants in customer-facing applications

Opportunities

  • Model evaluation tooling maintainers (EleutherAI LM Evaluation Harness, Hugging Face lighteval) gain adoption momentum as the community converges on reproducible abliteration benchmarks
  • Developers building on abliterated models can adopt coder3101's variant as a validated community baseline, reducing internal benchmarking overhead before deployment
  • Safety researchers can apply the KL divergence methodology from this benchmark to design abliteration-resistant fine-tuning approaches for future open-weight model releases

What we don't know yet

  • Which of the eight capability benchmarks showed the most variance across variants, and whether any single benchmark reliably predicts overall capability retention
  • Whether coder3101's variant maintains its capability lead at longer context lengths or on domain-specific tasks outside the original benchmark set
  • Whether the full 44-GPU-hour evaluation suite is reproducible on hardware below RTX 5090 tier, limiting who can independently verify or extend the results