First open-source 7B omni-modal large language model capable of concurrently processing and analyzing image, video, audio, and text modalities. Uses a two-stage training schema with multimodal alignment and multitask fine-tuning. Developed in collaboration with Westlake University and Zhejiang University.

Outputs 2

Baichuan-Omni Technical Report

paper

Technical report describing the omni-modal architecture, two-stage multimodal training schema, and evaluation results.

arXiv: 2410.08565

Baichuan-Omni (model)

model

Open-source 7B multimodal model for image, video, audio, and text understanding.

Architecture DENSE
Parameters 7B
open-weightmultimodalaudiovision