LLaDA 2.0-Uni
modelUnified discrete diffusion language model that natively combines multimodal understanding and generation in a single architecture. 16B MoE with only ~1B parameters activated per token. Three core components: SigLIP-VQ discrete semantic tokenizer (visual inputs → discrete tokens), dLLM-MoE backbone using block-level masked diffusion with Mask Token Prediction, and a diffusion decoder with 8-step distilled inference for high-fidelity 1024×1024 generation.
Capabilities span text-to-image generation, VQA, image editing, document understanding, and interleaved generation with reasoning. Includes SPRINT acceleration (KV cache reuse + adaptive unmasking) for efficient inference. Matches specialized VLMs on multimodal understanding while delivering strong image generation and editing. Apache 2.0. By Inclusion AI (Ant Group AGI Research Center).