Native multimodal model trained solely via next-token prediction, demonstrating that a single architecture can match task-specific methods across generation and perception. Published in the main issue of Nature — the second Chinese large model team (after DeepSeek) to achieve this, and China's first Nature paper on the multimodal large model route.

Outputs 2

Emu3: Next-Token Prediction is All You Need

paper

Research showing native multimodal models can be trained solely via next-token prediction.

arXiv: 2409.18869

Multimodal learning with next-token prediction (Nature)

paper

Emu3 research published in the main issue of Nature, demonstrating that multimodal models trained solely via next-token prediction can match task-specific methods across generation and perception. Shows coherent high-fidelity video generation, interleaved vision-language generation, and vision-language-action modelling for robotic manipulation.

arXiv: 2409.18869

Venue: Nature

multimodalgenerationarchitectureopen-weight

Related