reddit.com via Reddit May 31st 2026

Gemma 4 Abliteration Variants Ranked on Capability

open source safety google open-source safety benchmarks

Key insights

coder3101's variant led all 13 abliterated Gemma 4 E2B entries on capability retention across eight benchmarks.
Some abliteration methods destroyed model capability alongside safety filters, creating a wide performance gap among tested variants.
This is the first systematic head-to-head comparison of competing abliteration approaches for a current Google open-weight model.

Why this matters

The benchmark gives local AI developers the first rigorous, multi-axis comparison of refusal-removal techniques for a current Google model, replacing guesswork with reproducible data. KL divergence as a selection criterion separates abliteration methods that subtly damage model behavior from those that only affect safety filters, which is critical for any production use case requiring consistent outputs. As open-weight models become standard components in commercial and research pipelines, community-generated quality gates like this will increasingly determine which variants get adopted at scale.

Summary

Thirteen abliterated variants of Google's Gemma 4 E2B have been ranked head-to-head for the first time, following a 44-GPU-hour community benchmark run on a single RTX 5090. The researcher tested each variant against HarmBench safety metrics, KL divergence from the base model, and eight capability benchmarks to identify which refusal-removal techniques preserve model performance. coder3101's variant ranked first on capability retention, separating itself from methods that strip safety filters at the cost of underlying model quality. Essentially: (Google's Gemma 4, r/LocalLLaMA community) the open-weight abliteration ecosystem now produces systematic benchmarking that official channels won't. - coder3101's variant led all 13 entries on capability retention across eight benchmark tasks - KL divergence from the base model measured how much each technique altered learned distributions beyond safety filters - Several variants degraded capability alongside safety filters, revealing a wide quality spread among competing approaches Systematic community benchmarks like this are becoming the de facto quality gate for modified open-weight models in the absence of any official comparison infrastructure.

Potential risks and opportunities

Risks

Google could tighten Gemma 4 licensing terms to restrict distribution of abliterated weights, stranding developers who have already built pipelines on coder3101's variant
Researchers and companies deploying abliterated variants without independent validation risk capability regressions on tasks not covered by the eight benchmarks in this study
HarmBench scores may understate residual refusal-bypass rates in adversarial production settings, creating liability exposure for enterprises using these variants in customer-facing applications

Opportunities

Model evaluation tooling maintainers (EleutherAI LM Evaluation Harness, Hugging Face lighteval) gain adoption momentum as the community converges on reproducible abliteration benchmarks
Developers building on abliterated models can adopt coder3101's variant as a validated community baseline, reducing internal benchmarking overhead before deployment
Safety researchers can apply the KL divergence methodology from this benchmark to design abliteration-resistant fine-tuning approaches for future open-weight model releases

What we don't know yet

Which of the eight capability benchmarks showed the most variance across variants, and whether any single benchmark reliably predicts overall capability retention
Whether coder3101's variant maintains its capability lead at longer context lengths or on domain-specific tasks outside the original benchmark set
Whether the full 44-GPU-hour evaluation suite is reproducible on hardware below RTX 5090 tier, limiting who can independently verify or extend the results

Originally reported by reddit.com

Read the original article →

Original headline: r/LocalLLaMA: Developer Benchmarks 13 Abliterated Gemma 4 E2B Variants Across Safety, KL Divergence, and 8 Capability Tasks — 44 GPU Hours on RTX 5090