AV-SepFormer: Cross-attention SepFormer for Audio-Visual Target Speaker Extraction

Jiuxin Lin (Tsinghua University); Xinyu Cai (Tsinghua University); Heinrich Dinkel (Xiaomi ); Jun Chen (Tsinghua University); Zhiyong Yan (xiaomi); Yongqing Wang (xiaomi); Junbo Zhang (Xiaomi); Zhiyong Wu (Tsinghua University); Yujun Wang (xiaomi); Helen Meng (The Chinese University of Hong Kong)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.

Tags:

Multi-modal signal processing and analysis (audio/visual/haptics/radar/lidar etc.)

AV-SepFormer: Cross-attention SepFormer for Audio-Visual Target Speaker Extraction

Jiuxin Lin (Tsinghua University); Xinyu Cai (Tsinghua University); Heinrich Dinkel (Xiaomi ); Jun Chen (Tsinghua University); Zhiyong Yan (xiaomi); Yongqing Wang (xiaomi); Junbo Zhang (Xiaomi); Zhiyong Wu (Tsinghua University); Yujun Wang (xiaomi); Helen Meng (The Chinese University of Hong Kong)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation

Adaptive CSI Feedback with Hidden Semantic Information Transfer

The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Join the IEEE Signal Processing Society