REAL-TIME AUDIO-VISUAL END-TO-END SPEECH ENHANCEMENT

Zirun Zhu (Microsoft); Hemin Yang (Microsoft); Min Tang (Microsoft); Ziyi Yang (Microsoft); Sefik Emre Eskimez (Microsoft); Huaming Wang (Microsoft)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Audio-visual speech enhancement (AV-SE) methods utilize auxiliary visual cues to enhance speakers' voices. Therefore, technically they should be able to outperform the audio-only speech enhancement (SE) methods. However, there are few works in the literature on an AV-SE system that can work in real time on a CPU. In this paper, we propose a low-latency real-time audio-visual end-to-end enhancement (AV-E3Net) model based on the recently proposed end-to-end enhancement network (E3Net). Our main contribution includes two aspects: 1) We employ a dense connection module to solve the performance degradation caused by the deep model structure. This module significantly improves the model's performance on the AV-SE task. 2) We propose a multi-stage gating-and-summation (GS) fusion module to merge audio and visual cues. Our results show that the proposed model provides better perceptual quality and intelligibility than the baseline E3net model with a negligible computational cost increase.

Tags:

Machine/deep learning methodologies for multimedia

REAL-TIME AUDIO-VISUAL END-TO-END SPEECH ENHANCEMENT

Zirun Zhu (Microsoft); Hemin Yang (Microsoft); Min Tang (Microsoft); Ziyi Yang (Microsoft); Sefik Emre Eskimez (Microsoft); Huaming Wang (Microsoft)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

MRML: Multimodal Rumor Detection by Deep Metric Learning

Abusive activity detection with multi-modality based on convolutional neural network

IMPROVING THE MODALITY REPRESENTATION WITH MULTI-VIEW CONTRASTIVE LEARNING FOR MULTIMODAL SENTIMENT ANALYSIS

Join the IEEE Signal Processing Society