unsloth.ai web signal

DiffusionGemma 26B Runs at 2,000+ Tokens/Second on 18 GB RAM

TL;DR

  • DiffusionGemma 26B-A4B generates text in parallel using diffusion refinement rather than token-by-token decoding.
  • The 4-bit quantized model requires only 18 GB RAM and reaches 2,000+ tokens per second on an RTX 6000.
  • Speed comes at a cost: AIME 2026 accuracy drops from 88.3% on standard Gemma 4 to 69.1% on DiffusionGemma.

Token-by-token generation has been the default for large language models since the transformer era, but Unsloth's documentation for DiffusionGemma shows what happens when you borrow the image diffusion paradigm instead. DiffusionGemma 26B-A4B, described as Google DeepMind's multimodal open model built on the Gemma 4 MoE architecture, produces outputs in parallel and gradually refines them into a final answer -- the same conceptual loop image diffusion models use, applied to text. According to the documentation, it reaches 2,000+ tokens per second on an RTX 6000 GPU.

The accessibility figure that stands out is 18 GB of RAM for the 4-bit quantized version. That threshold sits squarely within the range of a single consumer or prosumer GPU. The model also handles a 256K token context window and supports over 140 languages, with multimodal inputs across text, images, and video -- a broad capability profile for something runnable on a local workstation.

The honest caveat is the benchmark table. DiffusionGemma trades reasoning performance for speed versus standard Gemma 4: MMLU Pro comes in at 77.6% against Gemma 4's 82.6%, LiveCodeBench at 69.1% against 77.1%, and AIME 2026 at 69.1% against 88.3%. That last gap is substantial for any task involving hard mathematical reasoning. Local deployment also requires a specific in-progress pull request (#24423) for llama.cpp rather than a stable release, meaning self-hosters are leaning on in-progress code.

What the documentation does not address is how parallel refinement handles tasks where each output logically depends on the previous step -- the sequential chain-of-thought that autoregressive models handle naturally. That open question matters given that agentic workflows and code generation are listed as primary use cases.

For teams building latency-sensitive products -- live coding assistants, high-volume document processing, low-latency chatbots -- the throughput case is real. Whether the reasoning tradeoff is acceptable depends entirely on the task, and the benchmark gap is concrete enough that it should be the first thing any team measures against their actual workload before switching.

Shared on Bluesky by 2 AI experts

  • Rafael Pinto @rcpinto.bsky.social amplified

    @unsloth.ai

    DiffusionGemma can now run at 2000+ tokens/sec! ⚡ We made local DiffusionGemma inference 1.8× faster. Run it on 18GB RAM via Unsloth Studio. GitHub: github.com/unslothai/un... Guide: unsloth.ai/docs/models/...

    View on Bluesky →
  • Abdoulaye Diack @diack.bsky.social amplified

    @unsloth.ai

    Google releases DiffusionGemma.✨ The new 26B-A4B diffusion text model runs locally on 18GB RAM. It supports high-speed text generation, thinking, image, video and 256K context. Run and train via Unsloth Studio. GGUF: …

    View on Bluesky →