arXiv: arXiv:2207.04646
Scaling text-to-speech (TTS) with language model to large-scale and in-the-wild datasets is making great progress to capture the diversity and expressiveness in human speech such as speaker identities and prosodies, but the waveform reconstruction quality from discrete speech token is far from satisfaction mainly depending on compressed speech token size. Generative diffusion models trained with score-matching loss and continuous normalized flow trained with flow-matching loss have become prominent in generation of images as well as speech, especially in domains like conversation etc. Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens autoregressively, and then use a diffusion model to up sample coarse-grained speech tokens into fine-grained codec features or mel-spectrograms before reconstructing into waveforms, which has high latency and is not realistic for real time speech applications. In this paper, we systematically investigate varied diffusion models for up sampling stage, which is the main bottleneck for streaming synthesis of language model and diffusion based architecture, and present the model architecture, objective and subjective measurement difference on the same dataset.
Model | step 1 | step 6 | step 12 |
---|---|---|---|
DDPM | |||
EDM | |||
flow matching | |||
reflow | |||
multiflow | |||
consistency distill | |||
bespoke-step5 | |||
bespoke-step8 |