End-to-end multimodal model processing text, images, audio, and video with real-time speech generation. Thinker-Talker architecture. Over 80k downloads in first week on HuggingFace.

Model Details

Architecture DENSE
Parameters 7B
multimodalaudioopen-weight