Cross-Modal Audio-Visual Co-learning for Text-independent Speaker Verification

Meng Liu (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Longbiao Wang (Tianjin University); Hanyi Zhang (Tianjin University); Chang Zeng (National Institute of Informatics); Jianwu Dang (School of Computer Science and Technology, Tianjin University, Tianjin, China; School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the test scenarios demonstrate that our proposed method achieves around 60% and 20% average relative performance improvement over baseline unimodal and fusion systems, respectively.

Tags:

Multimodal processing of language

Cross-Modal Audio-Visual Co-learning for Text-independent Speaker Verification

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus

Join the IEEE Signal Processing Society