Incremental text-to-speech synthesis using pseudo lookahead with large pretrained language model


Paper
arXiv preprint

Authors
Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari
(The University of Tokyo, Japan)

Abstract
Text-to-speech (TTS) synthesis, a technique for artificially generating human-like utterances from texts, has dramatically evolved with the advances of end-to-end deep neural network-based methods in recent years. The majority of these methods are sentence-level TTS, which can take into account time-series information in the whole sentence. However, it is necessary to establish incremental TTS, which performs synthesis in smaller linguistic units, to realize low-latency synthesis usable for simultaneous speech-to-speech translation systems. In general, incremental TTS is subject to a trade-off between the latency and quality of output speech. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). This study proposes an incremental TTS method that uses the pseudo lookahead generated with a language model to consider the future contextual information without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality without increasing the latency than the method using only observed information and 2) reduces the latency while achieving the equivalent speech quality to waiting for the future context observation.


UtteranceGroundtruthFullsentenceBicontext (truth)IndependentUnicontextBicontextBicontext (fine-tuned)
LJ001-0051
LJ017-0190
LJ050-0162
LJ029-0197
LJ016-0380
LJ009-0041
LJ050-0029
LJ012-0257
LJ023-0016
LJ034-0197



References

  1. J. Shen, et al., "Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions," in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 4779-4783.
  2. M. Ma, et al., "Incremental text-to-speech synthesis with prefix-to-prefix framework," in Proc. EMNLP, Online, Nov. 2020, pp. 3886-3896.
  3. T. Yanagita, et al., "Neural iTTS: Toward synthesizing speech in real-time with end-to-end neural text-to-speech framework," in Proc. SSW, Vienna, Austria, Sep. 2019, pp. 183-188.