An open-weights discrete-diffusion language model built on the Gemma 4 26B-A4B MoE base (25.2B total / 3.8B active, 128 experts + 1 shared, top-8; 256K context; ~550M vision encoder for image/video input). Instead of left-to-right token generation, DiffusionGemma iteratively denoises blocks ("canvases") of tokens in parallel: an autoregressive encoder caches prompt context while a bidirectional decoder refines the generation canvas, with an Entropy-Bounded Denoising + Adaptive Stopping sampler (≤48 steps). Google reports 15–20 tokens per forward pass and >1,100 tok/s per user at low batch on H100/FP8.

Apache-2.0. The diffusion variant trades some quality for speed vs the autoregressive Gemma 4 sibling (e.g. AIME 2026 69.1 vs 88.3, GPQA Diamond 73.2 vs 82.3). Self-reported: MMLU-Pro 77.6, AIME 2026 69.1, LiveCodeBench v6 69.1, GPQA Diamond 73.2, τ²-Bench 56.2, MMMU-Pro 54.3. Not yet on the AA Intelligence Index.

Model Details

Architecture MOE
Parameters 25.2B
Active params 3.8B
Experts 128 (top-8)
Context window 262,144
License Apache 2.0

Benchmark Scores

Benchmark Score Mode
MMLU-Pro 77.6
AIME 2026 69.1
LiveCodeBench v6 69.1
GPQA Diamond 73.2
τ²-Bench 56.2
MMMU-Pro 54.3
open-weightmoemultimodalarchitectureresearch

Related