DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

arXiv: arXiv:2207.04646

Authors

Abstract

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS) and the vocoder (the decoder of VQ-GAN), with an auxiliary loss on the acoustic model to predict intermediate speech representations. Experiments show that DelightfulTTS 2 achieves a CMOS gain +0.14 over DelightfulTTS, and more method analyses further verify the effectiveness of the developed system.

Systems

Audio Samples

Copy-synthesis

Text I asked the RSC , and I thought they'd surely say no , but they didn't . One other point , the famous ethnic differences of Yugoslavia , they don't exist . Sure enough , the model is bait for Bob and spends much of the film terribly sad .
GT
Copy-synthesis

TTS

Text Less fit letters are not forwarded , and so die . It's very troubling to me and it should be very troubling to all Americans . The premiers of the provinces of Ontario and Quebec also stayed behind .
Fastspeech
DelightfulTTS1
DelightfulTTS2


Contact the author