Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 11:05
04 May 2020

In this paper we present an end-to-end speech recognition system that can recognize single-channel speech where multiple talkers can speak at the same time (overlapping speech) by using a neural network model based on Recurrent Neural Network Transducer (RNN-T) architecture. We augment the conventional RNN-T architecture by including a masking model for separation of encoded audio features, and multiple label encoders to encode transcripts from different speakers. We use a masking L2 loss to prevent transcripts to align to wrong speakers' audio, and a speaker embedding loss to facilitate speaker tracking. We show that by using these additional training objectives, the proposed augmented RNN-T model can be trained with simulated overlapping speech data and can achieve a WER of 32% on words in overlapping speech segments from real-life telephone conversations. Our analysis of manual transcription task on the same test set shows that transcribing overlapping speech is hard even for humans who can get a WER of 37% compared to ground-truth.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00