Diffusion probabilistic model for text-to-speech synthesis. Uses a score-based decoder to produce mel-spectrograms by gradually transforming noise predicted by the encoder, aligned via Monotonic Alignment Search. Enables flexible trade-off between sound quality and inference speed. Competitive with state-of-the-art TTS in Mean Opinion Score. One of the first applications of diffusion models to speech synthesis.

Outputs 2

Grad-TTS

model

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

paper

arXiv: 2105.06337

audiogenerationopen-source