Boosting diffusion model for spectrum upsampling in text-to-speech: An empirical study

arXiv: arXiv:2207.04646

Authors

Abstract

Scaling text-to-speech (TTS) with language model to large-scale and in-the-wild datasets is making great progress to capture the diversity and expressiveness in human speech such as speaker identities and prosodies, but the waveform reconstruction quality from discrete speech token is far from satisfaction mainly depending on compressed speech token size. Generative diffusion models trained with score-matching loss and continuous normalized flow trained with flow-matching loss have become prominent in generation of images as well as speech, especially in domains like conversation etc. Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens autoregressively, and then use a diffusion model to up sample coarse-grained speech tokens into fine-grained codec features or mel-spectrograms before reconstructing into waveforms, which has high latency and is not realistic for real time speech applications. In this paper, we systematically investigate varied diffusion models for up sampling stage, which is the main bottleneck for streaming synthesis of language model and diffusion based architecture, and present the model architecture, objective and subjective measurement difference on the same dataset.

Audio Samples

Model step 1 step 6 step 12
DDPM
EDM
flow matching
reflow
multiflow
consistency distill
bespoke-step5
bespoke-step8


Contact the author