Ant Group's native multimodal model built on the Ling backbone. Handles vision, speech, audio, and music.

Paper

arXiv: 2506.09344

multimodalaudioopen-weight

Related