Universal Audio Tokenizer

Compact single-codebook audio tokenizer unifying general audio perception and linguistic alignment for downstream Audio-LLMs. WhisperVQ encoder + flow-based decoder + HiFi-T vocoder, paired with a Semantic-Acoustic Equilibrium mechanism that adaptively injects fine-grained acoustic detail from shallow encoder layers into the deep semantic stream.

25 Hz frame rate, 8,192 codebook, 325 bps. Designed to be a unified audio I/O interface for Audio-LLMs: speech reconstruction + TTS synthesis + general audio-event discrimination (speech, sound, music) in one tokenizer. Companion paper introduces UniAudio-Token built on a Qwen2.5 LLM backbone.

HuggingFace Paper (arXiv)GitHub

Paper

arXiv HTML

audiospeechtokenizerfoundation-model

Your notes

Paper