REAL-TIME TARGET SOUND EXTRACTION

Bandhav Veluri (University of Washington); Justin Chan (University of Washington); Malek Itani (University of Washington); Tuochao Chen (University of Washington); Takuya Yoshioka (Microsoft); Shyamnath Gollakota (University of Washington)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2–3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2–4x smaller model size and a 1.5–2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.

Tags:

Audio signal enhancement and restoration

REAL-TIME TARGET SOUND EXTRACTION

Bandhav Veluri (University of Washington); Justin Chan (University of Washington); Malek Itani (University of Washington); Tuochao Chen (University of Washington); Takuya Yoshioka (Microsoft); Shyamnath Gollakota (University of Washington)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

MAID: A Conditional Diffusion Model For Long Music Audio Inpainting

An empirical study on speech restoration guided by self-supervised speech representation

CENTRALIZED CASCADE MULTI-CHANNEL NOISE REDUCTION AND ACOUSTIC FEEDBACK CANCELLATION IN A WIRELESS ACOUSTIC SENSOR AND ACTUATOR NETWORK

Join the IEEE Signal Processing Society