Gemma 4 12B

Encoder-free unified multimodal Gemma — a ~12B dense Transformer (11.95B per the model card) sitting between the E4B edge variant and the 26B-A4B MoE in the Gemma 4 family. Targets 12–16 GB VRAM, the first mid-sized open-weight model with native audio input. Apache 2.0.

The "unified" designation refers to its encoder-free architecture: instead of a separate vision tower or audio encoder, raw inputs project directly into the LLM's embedding space via lightweight linear maps. Per Maarten Grootendorst's visual guide, that means:

Vision: a ~35M-param embedder (vs ~550M in larger Gemma 4 variants) ingests 48×48 pixel patches (6,912 raw pixel values per patch) through a single projection layer, with learnable 1120×3840 X/Y coordinate matrices for spatial position.
Audio: input is split into 40 ms windows at 16 kHz (640 raw amplitude samples each) and linearly projected into token space — no encoder, no positional embeddings, processed "much like text sequences."
Text-side architecture: 48 layers, 3,840 hidden dimension, interleaved local-window attention with global attention (global always last), 1,024-token sliding window, 262K vocabulary, 256K context. Multi-Token Prediction (MTP) drafters reduce latency. Training data cut-off January 2025.

Reports on the HuggingFace model card: MMLU Pro 77.2%, AIME 2026 (no tools) 77.5%, LiveCodeBench v6 72.0%, GPQA Diamond 78.8%, MMMU-Pro Vision 69.1%, CoVoST (audio) 38.5%. Google positions it as approaching the larger 26B-A4B MoE on standard text benchmarks at <50% of the memory footprint.

Distribution: HF (Instruct + Base), Kaggle, LM Studio, Ollama, Google AI Edge Gallery App; toolchain support for Transformers, llama.cpp, MLX, SGLang, vLLM, Unsloth. Not yet scored on Artificial Analysis.

Announcement HuggingFace (Instruct)HuggingFace (Base)A Visual Guide to Gemma 4 12B (Maarten Grootendorst)Documentation Artificial Analysis

Model Details

Architecture DENSE

Parameters 11.95B

Context window 262,144

AA Intelligence 22

License Apache 2.0

Base model gemma-4

Benchmark Scores

Benchmark	Score	Mode
MMLU Pro	77.2%	—
AIME 2026 (no tools)	77.5%	—
LiveCodeBench v6	72.0%	—
GPQA Diamond	78.8%	—
MMMU Pro Vision	69.1%	—
CoVoST (Audio)	38.5%	—

frontieropen-weightmultimodalon-device

Your notes

Model Details

Benchmark Scores

Related