Skip to main content

Improving Speech Recognition Using Consistent Predictions On Synthesized Speech

Andrew Rosenberg, Gary Wang, Zhehuai Chen, Yu Zhang, Bhuvana Ramabhadran, Yonghui Wu, Pedro Moreno

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 13:02
04 May 2020

Speech synthesis has advanced to the point of being close to indistinguishable from human speech. However, efforts to train speech recognition systems on synthesized utterances have not been able to show that synthesized data can be effectively used to augment or replace human speech. In this work, we demonstrate that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance. We also find that training on 460 hours of LibriSpeech augmented with 500 hours of transcripts (without audio) performance is within 0.2% WER of a system trained on 960 hours of transcribed audio. This suggests that with this approach, when there is sufficient text available, reliance on transcribed audio can be cut nearly in half.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00