LongCat-Next
modelNative multimodal foundation model (74B MoE, ~3B active) unifying text, vision, and audio understanding and generation via DiNA (Discrete Native Autoregression). Introduces dNaViT for dynamic visual tokenization and achieves 28x visual compression with strong text rendering. Supports image generation, TTS, voice cloning, and low-latency voice conversation.
Model Details
Architecture MOE
Parameters 74B
Active params 3B