What would happen if we tried the diffusion generation on LLMs? We get Diffusion Gemma! 4x speed up! ⚡⚡⚡💎 blog.google/innovation-a...
- DiffusionGemma generates 256 tokens per forward pass using bidirectional attention, reaching 1,000+ tokens/sec on a single H100 GPU.
- With only 3.8B active parameters during inference and an 18GB VRAM footprint when quantized, it runs on consumer hardware without server-grade resources.
- Google recommends DiffusionGemma only for speed-critical workloads like in-line editing and code infilling, not for applications requiring maximum quality.