Qwen2.5-Omni-7B | Lab Index

End-to-end multimodal model processing text, images, audio, and video with real-time speech generation. Thinker-Talker architecture. Over 80k downloads in first week on HuggingFace.

Blog Post HuggingFace GitHub

Model Details

Architecture DENSE

Parameters 7B

multimodalaudioopen-weight