Unified open-source omni-modal large language model for audio-visual multi-turn interaction. Ranges from 4B to 8B parameters, integrating vision encoder, audio encoder, LLM, and speech decoder into a single model for comprehensive understanding and generation tasks. Leads the field of lightweight omni-modal models.

Outputs 2

InteractiveOmni Model

model

Variants

Name Parameters Notes
InteractiveOmni-4B 4B
InteractiveOmni-8B 8B

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

paper

arXiv: 2510.13747

multimodalaudioopen-source