-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:57
Text-to-speech synthesis(TTS) is often applied as a data augmentation approach for automatic speech recognition(ASR), leveraging additional texts for ASR training. However, in low resource dataset, only a limited number of speakers are available, leading to lack of speaker variations in synthetic speech. In this paper, we propose a speaker augmentation approach that synthesizes data with sufficient speaker diversity. We train our TTS system conditioned on speaker representations from a variational autoencoder(VAE), which enables TTS to synthesize speech from unseen new speakers via sampling from latent distribution. Then the augmented data is used for ASR training. We first assume only 5 hours dataset is available in our experiments. Our approach reduces relative WER by 6.5% and 7.7% on the two test sets respectively over the baseline system. Then we find that ASR still benefits from our approach when SpecAugment is combined, especially when more real data is available. We also explore how our approach performs when the texts for speech synthesis increase. Our approach combined with SpecAugment obtains a relative WER reduction of 19.3% and 17.8% on the two test sets compared with applying SpecAugment only.