Gemma 4 12B
modelEncoder-free unified multimodal Gemma — a ~12B dense Transformer (11.95B per the model card) sitting between the E4B edge variant and the 26B-A4B MoE in the Gemma 4 family. Targets 12–16 GB VRAM, the first mid-sized open-weight model with native audio input. Apache 2.0.
The "unified" designation refers to its encoder-free architecture: instead of a separate vision tower or audio encoder, raw inputs project directly into the LLM's embedding space via lightweight linear maps. Per Maarten Grootendorst's visual guide, that means:
- Vision: a ~35M-param embedder (vs ~550M in larger Gemma 4 variants) ingests 48×48 pixel patches (6,912 raw pixel values per patch) through a single projection layer, with learnable 1120×3840 X/Y coordinate matrices for spatial position.
- Audio: input is split into 40 ms windows at 16 kHz (640 raw amplitude samples each) and linearly projected into token space — no encoder, no positional embeddings, processed "much like text sequences."
- Text-side architecture: 48 layers, 3,840 hidden dimension, interleaved local-window attention with global attention (global always last), 1,024-token sliding window, 262K vocabulary, 256K context. Multi-Token Prediction (MTP) drafters reduce latency. Training data cut-off January 2025.
Reports on the HuggingFace model card: MMLU Pro 77.2%, AIME 2026 (no tools) 77.5%, LiveCodeBench v6 72.0%, GPQA Diamond 78.8%, MMMU-Pro Vision 69.1%, CoVoST (audio) 38.5%. Google positions it as approaching the larger 26B-A4B MoE on standard text benchmarks at <50% of the memory footprint.
Distribution: HF (Instruct + Base), Kaggle, LM Studio, Ollama, Google AI Edge Gallery App; toolchain support for Transformers, llama.cpp, MLX, SGLang, vLLM, Unsloth. Not yet scored on Artificial Analysis.
Model Details
Benchmark Scores
| Benchmark | Score | Mode |
|---|---|---|
| MMLU Pro | 77.2% | — |
| AIME 2026 (no tools) | 77.5% | — |
| LiveCodeBench v6 | 72.0% | — |
| GPQA Diamond | 78.8% | — |
| MMMU Pro Vision | 69.1% | — |
| CoVoST (Audio) | 38.5% | — |