DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

Authors

Abstract

Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trained on speech corpora that contain both additive noise and environmental distortion. It jointly represents the time-variant additive noise with a frame-level encoder and the time-invariant environmental distortion with an utterance-level encoder. We also propose a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information such as linguistic contents and speaker characteristics. Evaluation results show that our method achieved significantly higher-quality synthetic speech than previous methods in the condition including both additive noise and reverberation.

Audio samples

1) Clean condition

SpeakerClean GTEnhancement TTS [1]Noise-robust TTS [2]DRSpeechDRSpeech w/o regularization
p232
p234
p304
p273

2) Noise condition

SpeakerClean GTDegraded GTEnhancement TTS [1]Noise-robust TTS [2]DRSpeechDRSpeech w/o regularization
p299
p243
p261
p286

3) Reverb condition

SpeakerClean GTDegraded GTEnhancement TTS [1]Noise-robust TTS [2]DRSpeechDRSpeech w/o regularization
p226
p266
p317
p311

4) Noise+Reverb condition

SpeakerClean GTDegraded GTEnhancement TTS [1]Noise-robust TTS [2]DRSpeechDRSpeech w/o regularization
p267
p256
p307
p279