I tried Google’s new DiffusionGemma and watching it generate text as an image is unlike any native LLM


Most domestic LLMs are now predictable. You download the model, show it runtime, ask questions, and then watch the text move across the screen one tick at a time. The model may be better or worse than the one you used yesterday, but the basic experience is usually the same.

DiffusionGemma is different, at least when you run it in visual mode. Google’s new experimental model Gemma doesn’t just write its answer from left to right. Instead, it works on one block of text at a time, incrementally replacing and refining tokens until the answer sits in place. The effect is similar to watching an image generator distort the image, where the “diffusion” process is intended. This is a very different experience compared to your typical LLM token generator.

I’ve tested it on an M4 Pro MacBook Pro using 4-bit GGUF via the custom llama.cpp hook detailed by Unsloth. Not only did it seem faster to me than running Google’s regular Gemma 4 26B-A4B, but beaten on my Mac, LLMs usually don’t, causing system-wide slowdowns. It’s still a strange experience, but also uniquely thrilling given how it’s done different compared to your typical autoregressive language model.

DiffusionGemma changes the appearance of text creation

Visual mode shows you exactly what’s going on

diffusiongemma-bird-pipes-pipes-pipes

DiffusionGemma feels weird because the output doesn’t come as normal text. With visual mode enabled, you can watch the 256-character canvas rewrite as the model runs, with placeholder-looking text appearing first before parts of it change, and the response gradually becomes more consistent. It’s not just a stream of words appearing at the end of a previous word, and that alone makes it feel like a distinct category of native pattern.

It sounds gimmicky, and in a way it is. A model does not need to track generations to be useful, and multi-native LLM interfaces are more accurate because they hide messy parts. But in this case, the visualization does a good job of explaining what makes DiffusionGemma different. You can read as much as you like about text spreading, but seeing the text change places over and over again makes the concept much easier to understand.

A normal autoregressive model should follow the next sign, then the one after that, and then the sign after that. It can plan loosely, and good models clearly do, but the token it writes now cannot directly dictate the exact token it will write 50 tokens because that token doesn’t exist yet. Instead, DiffusionGemma works on a block, which has bidirectional focus within the canvas. It can use the later parts of the block to enhance the earlier parts, so the output can appear to be in focus rather than typed.

This is a conceptual benefit of diffusion-based language models before even getting up to speed. The 256-token canvas gives the model a temporary sketch area where the beginning and end of the block can interact before the block is placed. That’s why diffusion is so interesting for inline editing, code completion, structured text, and other situations where the best answer isn’t always easy from left to right.

That’s why DiffusionGemma feels so different from the native models people are used to. We’re all used to seeing Gwen, Gemma, Llama or whatever text makes the model feel like she is. writing. In visual mode, DiffusionGemma feels more like editing the draft in front of you, but you can see every weird intermediate state on the way there.

Google’s speed claims require context

Especially if you’re running it on a Mac

diffusiongemma-how-it-works Credit: Google

Google’s proposition for DiffusionGemma is speed. In it start postGoogle says the model can provide up to 4x faster text generation on dedicated GPUs, with over 1000 tokens per second on dedicated GPUs and over 700 tokens per second on an RTX 5090. It also says that the quantized model can fit into 18GB of VRAM in high-end consumer GPUs.

My M4 Pro run was nothing like that. I didn’t get a normal reading per second, but the footer I captured showed a total of 137.9 seconds, 123 denoising steps, and 9 blocks, which works out to 1.121 seconds per step. Since each block is a canvas of 256 markers, this corresponds to 2304 canvas positions in 123 steps, or about 18.7 marker positions per denoising step.

This figure is what I’m sure people are missing. Google talked about parallel denoising and generating 15-20 tokens per forward pass, so a headline number like 700 tokens per second shouldn’t read like 700 left-to-right autoregressive tokens looking clean on screen. I don’t think Google just counts the discarded guesses as the finished product, but DiffusionGemma achieves its output differently: it refines multiple marker positions within the canvas before blocking. It can be speed realbut the experience is not the same as watching a normal model stream with 700 final tokens per second.

Equipment is also important. My Mac slowed down system-wide while running, and it didn’t feel any faster than running Google’s regular Gemma 4 26B-A4B locally. Google cautions that Apple Silicon Macs may not see the same speedup because combined memory systems are often tied to memory bandwidth when rendering results, while DiffusionGemma’s speedup depends on giving a larger compute-heavy workload to a dedicated accelerator.

This doesn’t make the speed claim wrong, it just means the raw performance will never be the interesting part of my running. What was interesting was seeing the model use a markedly different generation process and seeing how much that changed the feel of interacting with the native LLM.

It’s still early and a bit awkward to run it locally

This is not yet normal llama.cpp support

This was the route I used to get this to work Open the GGUF pathdepends on the DiffusionGemma branch from the open llama.cpp pull request. Unsloth’s instructions build a custom llama-diffusion-cli runner because the default llama-cli or llama-server cannot yet generate a path from the model. You can see what the offspring looks like in the video above.

This distinction is important if you are used to it Ollama or llama.cpp are easy native LLM standards. This isn’t the type of model you casually drop into your existing setup and treat as another GGUF. If you want the part that makes it visually interesting, you need the right branch, the right runner, and the –diffusion-visual flag. Once compiled, the command to run it with visual output is:

./llama-diffusion-cli -m ./diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 4096 --diffusion-visual

Quantized files are at least realistic for consumer hardware. Unsloth lists the 16GB Q4KM file as the smallest option, with larger options of 18GB, 21GB, 25GB and 47GB. This puts the model in the same general world as other large native models where you can run it on a GPU with a decent amount of VRAM.

Although this is still an experimental build. Model support, runner and visual output are part of the point now, not the rough edges around an otherwise boring daily driver model. If you’ve been hearing about diffusion models for a while and want to check one out for yourself, this is the place to go.

DiffusionGemma is not a straight upgrade over Gemma 4

Google says that quality is a trade-off

The name makes DiffusionGemma sound like another member of the Gemma family, but the model has a very different purpose. Google describes it as Gemma 4 26B An experimental open model based on the A4B Mixture of Experts architecturewith about 26B total parameters and about 4B active parameters. The unusual part is the diffusion head and block-based generation, not the main idea of ​​the local TN model.

Google is very clear that its standard autoregressive Gemma 4 models remain the recommendation for maximum output quality. DiffusionGemma prioritizes speed and parallel layout generation and published benchmark table generally shows it to be behind the standard Gemma 4 26B A4B on the reasoning, coding, vision and long context tests.

The particular test I ran was at least functional. I asked him to create a Flappy Bird-style game in Python, render it in the browser and serve it with Flask, and the resulting project worked when I tested it. The gravity was too strong, so it wasn’t very pleasant to play, but it produced the Flask app, HTML, CSS, and JavaScript needed to get a browser game running on screen. In the video above you can see it working as I copied and pasted the code from the output into the designated files. You can too Read the full speech at this Gist link.

It didn’t matter what I asked the model to generate, and Flappy Bird was just one of an infinite number of prompts I could use, because the result remained the same: the model generation process did something normal, even though it looked like anything. DiffusionGemma isn’t interesting because it suddenly makes native encoding better; it’s interesting because it exposes a different way of creating text, where output is compiled, refined, and executed in blocks rather than broadcast one token at a time. My experience here overlaps with Google’s examples of inline editing, non-linear text structures, code completion, and other workflows where left-to-right generation isn’t always the most natural form. I wouldn’t claim that DiffusionGemma is ready to replace these workflows from my brief, but seeing it happen, the concept makes sense.

DiffusionGemma is experimental, early, and not something I would compare to a regular native LLM. Watching the answer shift is weird, a bit distracting, and really useful for understanding what Google is trying to do, while also making it easier than ever for anyone to understand what the diffusion model looks like in practice.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *