Conditional Conformer: Improving Speaker Modulation for Single and Multi-User Speech Enhancement
Tom O'Malley (Google); Shaojin Ding (Google); Arun Narayanan (Google Inc.); Quan Wang (Google); Rajeev Rikhye (Google); Qiao Liang (Google Inc.); Yanzhang He (Google); Ian McGraw ()
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Recently, Feature-wise Linear Modulation (FiLM) has been shown
to outperform other approaches to incorporate speaker embedding
into speech separation and VoiceFilter models. We propose an
improved method of incorporating such embeddings into a VoiceFilter frontend for automatic speech recognition (ASR) and textindependent speaker verification (TI-SV). We extend the widelyused Conformer architecture to construct a FiLM Block with additional feature processing before and after the FiLM layers. Apart
from its application to single-user VoiceFilter, we show that our
system can be easily extended to multi-user VoiceFilter models via
element-wise max pooling of the speaker embeddings in a projected
space. The final architecture, which we call Conditional Conformer,
tightly integrates the speaker embeddings into a Conformer backbone. We improve TI-SV equal error rates by as much as 56%
over prior multi-user VoiceFilter models, and our element-wise max
pooling reduces relative WER compared to an attention mechanism
by as much as 10%