Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:57
04 May 2020

Text-to-speech synthesis(TTS) is often applied as a data augmentation approach for automatic speech recognition(ASR), leveraging additional texts for ASR training. However, in low resource dataset, only a limited number of speakers are available, leading to lack of speaker variations in synthetic speech. In this paper, we propose a speaker augmentation approach that synthesizes data with sufficient speaker diversity. We train our TTS system conditioned on speaker representations from a variational autoencoder(VAE), which enables TTS to synthesize speech from unseen new speakers via sampling from latent distribution. Then the augmented data is used for ASR training. We first assume only 5 hours dataset is available in our experiments. Our approach reduces relative WER by 6.5% and 7.7% on the two test sets respectively over the baseline system. Then we find that ASR still benefits from our approach when SpecAugment is combined, especially when more real data is available. We also explore how our approach performs when the texts for speech synthesis increase. Our approach combined with SpecAugment obtains a relative WER reduction of 19.3% and 17.8% on the two test sets compared with applying SpecAugment only.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00