Skip to main content

Cleanformer: A Multichannel Array Configuration-Invariant Neural Enhancement Frontend for ASR in Smart Speakers

Joseph P Caroselli (Google); Arun Narayanan (Google Inc.); Nathan Howard (Google); Tom O'Malley (Google)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

This work introduces Cleanformer - a streaming multichannel neural enhancement frontend for automatic speech recognition (ASR). This model has a Conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner. The time-frequency mask is applied to the noisy input to produce enhanced features for ASR. Detailed evaluations are presented with speech- and non-speech-based noise that show significant reduction in word error rate (WER) -- about 80% for -6dB SNR -- over a state-of-the-art ASR model alone. It also significantly outperforms enhancement using a beamformer with ideal steering. The enhancement model can be used with different microphone arrays without the need for retraining.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00