Encoder-free unified multimodal Gemma — a ~12B dense Transformer (11.95B per the model card) sitting between the E4B edge variant and the 26B-A4B MoE in the Gemma 4 family. Targets 12–16 GB VRAM, the first mid-sized open-weight model with native audio input. Apache 2.0.

The "unified" designation refers to its encoder-free architecture: instead of a separate vision tower or audio encoder, raw inputs project directly into the LLM's embedding space via lightweight linear maps. Per Maarten Grootendorst's visual guide, that means:

  • Vision: a ~35M-param embedder (vs ~550M in larger Gemma 4 variants) ingests 48×48 pixel patches (6,912 raw pixel values per patch) through a single projection layer, with learnable 1120×3840 X/Y coordinate matrices for spatial position.
  • Audio: input is split into 40 ms windows at 16 kHz (640 raw amplitude samples each) and linearly projected into token space — no encoder, no positional embeddings, processed "much like text sequences."
  • Text-side architecture: 48 layers, 3,840 hidden dimension, interleaved local-window attention with global attention (global always last), 1,024-token sliding window, 262K vocabulary, 256K context. Multi-Token Prediction (MTP) drafters reduce latency. Training data cut-off January 2025.

Reports on the HuggingFace model card: MMLU Pro 77.2%, AIME 2026 (no tools) 77.5%, LiveCodeBench v6 72.0%, GPQA Diamond 78.8%, MMMU-Pro Vision 69.1%, CoVoST (audio) 38.5%. Google positions it as approaching the larger 26B-A4B MoE on standard text benchmarks at <50% of the memory footprint.

Distribution: HF (Instruct + Base), Kaggle, LM Studio, Ollama, Google AI Edge Gallery App; toolchain support for Transformers, llama.cpp, MLX, SGLang, vLLM, Unsloth. Not yet scored on Artificial Analysis.

Model Details

Architecture DENSE
Parameters 11.95B
Context window 262,144
License Apache 2.0
Base model gemma-4

Benchmark Scores

Benchmark Score Mode
MMLU Pro 77.2%
AIME 2026 (no tools) 77.5%
LiveCodeBench v6 72.0%
GPQA Diamond 78.8%
MMMU Pro Vision 69.1%
CoVoST (Audio) 38.5%
frontieropen-weightmultimodalon-device

Related