DiffusionGemma-26B rivals AR sibling, decodes 3.5-4.4x faster
TL;DR
- DiffusionGemma-26B matches or exceeds its same-size autoregressive sibling Gemma-4-26B on every medical VQA dataset the authors tested.
- Decoding is reported at 3.5-4.4x faster than the AR baseline, with 3.8B active parameters after LoRA fine-tuning of the MoE model.
- Bidirectional denoising gives the model any-order infill, so a radiologist can fix report fragments and have the model fill between them.
A new arXiv preprint takes a diffusion language model and puts it head-to-head with an autoregressive model of the same size on medical visual question answering. On the two axes clinicians care about, quality and latency, the diffusion side comes out ahead in the authors' setup.
The setup is what makes the comparison interesting. The authors fine-tune DiffusionGemma-26B, described as a mixture-of-experts diffusion language model, with a LoRA recipe, and benchmark it against Gemma-4-26B, its same-size autoregressive sibling. Scoring is done by what the paper calls a verbosity-robust LLM judge. The abstract states that diffusion 'matches or exceeds AR on all' of the medical VQA datasets, that the finetuned model has 3.8B active parameters and is 'competitive with frontier vision-language models,' and that decoding is 3.5 to 4.4 times faster than the AR baseline.
The angle worth taking seriously is the drafting workflow, not the raw speedup. Autoregressive models emit tokens left to right; you cannot ask them to hold two edges of a paragraph fixed and refill the middle. Diffusion models can, because they denoise the canvas bidirectionally. The paper calls this 'any-order infill' and pitches it directly at radiologists: fix a couple of report fragments, let the model reweave the text between them. The authors say this operation is inherent to diffusion and that autoregression is 'subpar' at it.
The honest caveat is that this is a preprint whose quality claims rest on LLM-judged medical VQA rather than on radiologist grading, and the 3.5-4.4x figure is decoding speed rather than end-to-end round-trip in a real reporting system. What the reporting does not give you is a clinician-in-the-loop evaluation of the infill workflow, or a spelled-out list of which 'frontier vision-language models' the finetuned model is being called competitive with.
If the infill claim holds up outside the benchmark suite, the immediate winners are the teams building clinical drafting tools, who have been trying to shoehorn AR completions into a workflow that is really about editing existing text, not writing it from scratch.
Originally reported by paper
Read the original article →Original headline: Discrete Diffusion Beats Autoregressive for Radiology at 3.5–4.4x Speed With Bidirectional Fill-In