Unified discrete diffusion language model that natively combines multimodal understanding and generation in a single architecture. 16B MoE with only ~1B parameters activated per token. Three core components: SigLIP-VQ discrete semantic tokenizer (visual inputs → discrete tokens), dLLM-MoE backbone using block-level masked diffusion with Mask Token Prediction, and a diffusion decoder with 8-step distilled inference for high-fidelity 1024×1024 generation.

Capabilities span text-to-image generation, VQA, image editing, document understanding, and interleaved generation with reasoning. Includes SPRINT acceleration (KV cache reuse + adaptive unmasking) for efficient inference. Matches specialized VLMs on multimodal understanding while delivering strong image generation and editing. Apache 2.0. By Inclusion AI (Ant Group AGI Research Center).

Model Details

Architecture MOE
Parameters 16B

Paper

multimodalgenerationvisionopen-weightmoe

Related