Google’s DiffusionGemma generates 256 tokens in parallel and corrects itself as it goes



GenAI image generators like Stable Diffusion don’t draw images pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until they converge in a process known as diffusion. For years, the same principle could not be applied at scale to generate text.

Standard language models work like a typewriter: one token at a time, left-to-right, no way to revisit the delivered output. This example runs in the cloud, where batch sizes keep GPUs saturated. For local inference or low-parallel deployments, the GPU is often idle.

DiffusionGemma, which Google released this week, is an open-source experimental model that applies diffusion to generate text at production scale. built on Gemma 4 backbone and released under the Apache 2.0 license, it is the first natively supported diffusion language model on the open source vLLM extraction platform. It generates a block of 256 characters in parallel rather than sequentially, with each character position influencing each other. Google says that DiffusionGemma renders text on GPUs 4x faster than standard models. The FP8 version with a batch size of 1 on a single Nvidia H100 reaches 1008 tokens per second. On the H200, it reaches 1,288, nearly six times the standard autoregressive baseline, according to vLLM benchmark results published today.

Despite the speed boost, Google didn’t oversell the release. of the company start post DiffusionGemma directly acknowledged and added that the overall output quality of the Gemma is inferior to the standard Gemma 4 "For applications that require maximum quality, we recommend deploying the standard Gemma 4."

What DiffusionGemma does

DiffusionGemma does not generate verses in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and performs several refinement passes over the entire block at once. On each pass, he evaluates each position and locks in the ones he trusts the most. Uncertain positions are randomized and revisited in the next pass, with the model using what it solved in the previous round to inform the next attempt. The block is gradually joined until the position is stabilized enough to close the rest.

Two things follow from this architecture.

  • Self-correction. An autoregressive model that makes an error token is stuck with it because subsequent tokens are already conditioned on the error. DiffusionGemma can identify low-confidence positions and re-evaluate them in the next pass.

  • Bidirectional context. Each position refers to every other position in the block at the same time, including tokens that appear later in the sequence. This makes the model more suitable for constrained generation tasks where left-to-right generation fails structurally.

Google demonstrated both features with a fine-tuned Sudoku solver. The basic model solved zero puzzles. After fine-tuning on the Sudoku database, it achieved an 80% success rate and converged in 12 denoising steps instead of 48. The increase in efficiency came directly from the model’s ability to self-correct and stop early.

How is it built?

DiffusionGemma works as a 26B Expert Mixture model that only activates 3.8B parameters when inferring. It matches 18GB of VRAM on consumer hardware, including the Nvidia RTX 4090 and 5090. Google and NVIDIA are also optimized for corporate Hopper and Blackwell servers using NVFP4 cores.

The vLLM integration required new work because DiffusionGemma does not fit the standard service model. A typical vLLM stack applies the same type of attention to each query. DiffusionGemma queries cyclically alternate between causal and bidirectional attention via fast reads, canvas refinements, and block commits. The team built per-demand attention switching into both the Triton and FlashAttention 4 backends and reused the existing speculative decoding path for the refinement loop.

The new ModelState interface the team built for this integration is designed to support additional diffusion models in vLLM as they emerge.

Where speed gains and where it doesn’t

DiffusionGemma’s speed advantage is real, but conditional. Where it is applied depends entirely on the deployment context.

Numbers. Published benchmarks of vLLM at batch size 1 on a single H100 put the FP8 model at about five times the standard autoregressive baseline. In the H200, about six times. These peak numbers represent optimal conditions: single user, dedicated hardware, FP8 quantization.

Where it wins. Local output, single user applications and low parallel service. Under such conditions, the GPU has redundant computation and memory bandwidth is the bottleneck. DiffusionGemma’s parallel block generation fills this gap.

Where it doesn’t exist. High performance cloud service. When a server sends hundreds of simultaneous requests, autoregressive models already saturate available computations, and DiffusionGemma’s parallel decoding provides diminishing returns.

Quality ceiling. Guilherme O’Tina, AI researcher, Place a finer point on the X. "Local artifacts and hallucinations are different challenges and decide where this actually wins," O’Tina wrote.

How does it compare

Diffusion language models are not new. Researchers have been building them on a smaller scale for several years and Inception Labs’ Mercury Encoder Commercially applied this approach to coding tasks in 2025. What DiffusionGemma adds is scale – a 26B MoE backbone, a local vLLM service, and a model tuned by general-purpose instructions rather than domain-specific ones.

A more useful comparison for engineers evaluating this with available inference tools is speculative decoding, and the distinction is important. Speculative decoding keeps the standard autoregressive target model and uses a smaller draft model to predict several cues ahead. The target model checks them in one pass. If the sample is correct, the output distribution remains the same as the target. Architecture is immutable.

Andrew Kuncevichan ML and AI researcher focusing on production AI systems, put it directly to X. "DiffusionGemma is different. It doesn’t just predict future tokens. It creates a noisy 256-token canvas and repeatedly denoises the entire block in parallel. So it’s not just a decoding trick, it’s a different generational paradigm," Kuncevich wrote.

Compared to the standard Gemma 4, the trade-off is speed for quality. Google’s benchmark data shows DiffusionGemma below the standard Gemma 4 on overall output quality measures, with the gap varying by task.

In structured constrained tasks, including code completion, templating, and problems requiring bidirectional constraint propagation, architecture has the structural advantage that fine-tuning can emerge, as the Sudoku result demonstrates. In the open generation, the standard Gemma 4 remains the more powerful choice.

What this means for businesses

DiffusionGemma is served via a standard vLLM OpenAI-compliant endpoint, with no diffusion-specific pipeline modifications required.

This is not a general purpose model upgrade.

The choice of architecture for local or low parallel inference teams has just been expanded. Until now, reducing generation latency on dedicated GPU hardware meant using a smaller model and accepting a quality trade-off. DiffusionGemma offers a third path in the same parameter footprint on consumer hardware with same-day vLLM support.

For limited generation workloads, bidirectional focus is worth evaluating. Completing code, creating structured data, and tasks where the correct output depends on a context that has not yet been created are where this architecture has a structural advantage.

The ModelState interface built for this integration is designed to generalize additional diffusion models as they arise.

Quality variation is real and Google acknowledges it. For teams doing native inference on dedicated GPU hardware, this is worth a try.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *