Omni-modal model (8B) capable of real-time speech-to-speech interaction and multimodal live streaming on mobile devices. Built on SigLip-400M + Whisper-medium-300M + ChatTTS-200M + Qwen2.5-7B. The first on-device model to achieve GPT-4o level across vision, speech, and streaming.

Model Details

Architecture DENSE
Parameters 8B
multimodalaudioon-deviceagentic

Related