Emu3 | Lab Index

Native multimodal model trained solely via next-token prediction, demonstrating that a single architecture can match task-specific methods across generation and perception. Published in the main issue of Nature — the second Chinese large model team (after DeepSeek) to achieve this, and China's first Nature paper on the multimodal large model route.

Paper (arXiv)GitHub

Outputs 2

Emu3: Next-Token Prediction is All You Need

paper

Research showing native multimodal models can be trained solely via next-token prediction.

Paper (arXiv)GitHub

Citations 4

arXiv HTML

Multimodal learning with next-token prediction (Nature)

paper 2026-01-28

Emu3 research published in the main issue of Nature, demonstrating that multimodal models trained solely via next-token prediction can match task-specific methods across generation and perception. Shows coherent high-fidelity video generation, interleaved vision-language generation, and vision-language-action modelling for robotic manipulation.

Nature Paper (arXiv)GitHub

Venue Nature

Citations 4

arXiv HTML

multimodalgenerationarchitectureopen-weight

Outputs 2

Emu3: Next-Token Prediction is All You Need

Multimodal learning with next-token prediction (Nature)

Related