SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

Author
Takaaki Saeki, Shinnosuke Takamichi, Tomohiko Nakamura, Naoko Tanji, Hiroshi Saruwatari
(The University of Tokyo, Japan)

Abstract
Restoring high-quality speech from degraded historical recordings is crucial for the preser- vation of cultural and endangered linguistic resources. A key challenge in this task is the scarcity of paired training data that replicates the original acoustic conditions of the historical audio. While previous approaches have used pseudo paired data generated by applying various distortions to clean speech corpora, their limitations stem from the inability to authentically simulate the acoustic variations in historical recordings. In this paper, we propose a self-supervised approach to speech restoration that does not require paired corpora. Our model has three main modules: analysis, synthesis, and channel modules, all of which are designed to emulate the recording process of degraded audio signals. Specifically, the analysis module disentangles undistorted speech and distortion features, and the synthesis module generates the restored speech waveform. The channel module then introduces distortions into the speech waveform to compute the reconstruction loss between the input and output degraded speech signals. We further improve our model by introducing several methods including dual learning and semi-supervised training frameworks. An additional feature of our model is the audio effect transfer, which allows acoustic distortions from degraded audio signals to be applied to arbitrary audio signals. Experimental evaluations demonstrate that our method significantly outperforms the previous supervised method for the restoration of real historical speech resources.

Utterance	Groundtruth	Original	Supervised [1]	Self-Mel	Self-SF	Self-Mel-RefEnc	Self-Mel-PLoss	Semi-Mel
Sample1
Sample2
Sample3

Utterance	Groundtruth	Original	Supervised [1]	Self-Mel	Self-SF	Self-Mel-RefEnc	Self-Mel-PLoss	Semi-Mel
Sample1
Sample2
Sample3

Utterance	Groundtruth	Original	Supervised [1]	Self-Mel	Self-SF	Self-Mel-RefEnc	Self-Mel-PLoss	Semi-Mel
Sample1
Sample2
Sample3

Utterance	Groundtruth	Original	Supervised [1]	Self-Mel	Self-SF	Self-Mel-RefEnc	Self-Mel-PLoss	Semi-Mel
Sample1
Sample2
Sample3

Utterance	Original	Supervised [1]	Self-Mel	Self-Mel-Pretrain	Self-Mel-Pretrain-PLoss	Semi-Mel	Semi-Mel-Pretrain	Semi-Mel-Pretrain-PLoss
Sample1
Sample2
Sample3

SelfRemaster: Self-Supervised Speech Restoration for Historical Audio Resources

1. Speech restoration demo with simulated data

2. Speech resoration demo with single-domain real data

3. Speech resoration demo with multi-domain real data

4. Audio effect transfer demo

Utterance	Source	Reference	High-quality	Mean spec. diff.	Proposed
Sample1
Sample2
Sample3

Utterance	Source	Reference	High-quality	Proposed
Sample1
Sample2
Sample3