Emu3.5
modelLarge-scale native multimodal world model that predicts the next vision-language state, pre-trained end-to-end on over 10T interleaved vision-language tokens from internet videos. Supports long-horizon vision-language generation, any-to-image (X2I) synthesis, and text-rich image creation. Includes Discrete Diffusion Adaptation (DiDA) for ~20x faster per-image inference. Performance comparable to Gemini 2.5 Flash Image on generation and editing tasks.
Model Details
Variants
| Name | Parameters | Notes |
|---|---|---|
| Emu3.5 | — | General-purpose multimodal, interleaved image-text |
| Emu3.5-Image | — | Specialized for T2I/X2I tasks |
| Emu3.5-VisionTokenizer | — | Vision encoding component |