Speech LLMs as Contextual Reasoning Transcribers

Introduces Chain-of-Thought ASR (CoT-ASR), enabling LLMs to analyze speech input and generate contextual reasoning before performing transcription. Uses a CTC-guided Modality Adapter to bridge the gap between speech and text representations.

Achieves a relative 8.7% reduction in WER and 16.9% reduction in entity error rate (EER) compared to standard LLM-based ASR. By Deng, Fan, Ren, Wang, and Li at Microsoft Core AI.

Paper (arXiv)

Paper

arXiv HTML

speechreasoningresearch