Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

Tsubasa Ochiai (NTT); Marc Delcroix (NTT); Tomohiro Nakatani (NTT Communication Science Laboratories); Shoko Araki (NTT Corporation)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in practice, which causes performance degradation. In this paper, we address the problem of mask-based beamforming for moving sources. We first review classical approaches to tracking a moving source, which perform online or blockwise computation of the SCMs. We show that these approaches can be interpreted as computing a sum of instantaneous SCMs weighted by attention weights. These weights indicate which time frames of the signal to consider in the SCM computation. Online or blockwise computation assumes a heuristic and deterministic way of computing these attention weights that, although simple, may not result in optimal performance. We thus introduce a learning-based framework that computes optimal attention weights for beamforming. We achieve this using a neural network implemented with self-attention layers. We show experimentally that our proposed framework can greatly improve beamforming performance in moving source situations while maintaining high performance in non-moving situations, thus enabling the development of mask-based beamformers robust to source movements.

Tags:

Image, Video, and Multidimensional Signal Processing

Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

Tsubasa Ochiai (NTT); Marc Delcroix (NTT); Tomohiro Nakatani (NTT Communication Science Laboratories); Shoko Araki (NTT Corporation)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Recallable Question Answering-based Re-ranking Considering Semantic Region for Cross-modal Retrieval

Self-Supervised Learning Based Anomaly Detection in Synthetic Aperture Radar Imaging

Selective Listening by Synchronizing Speech With Lips

Join the IEEE Signal Processing Society