THE DKU AUDIO-VISUAL WAKE WORD SPOTTING SYSTEM FOR THE 2021 MISP CHALLENGE

Ming Cheng, Haoxu Wang, Ming Li, Yechen Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:08:13

07 May 2022

This paper describes the system developed by the DKU team for the MISP Challenge 2021. We present a two-stage approach consisting of end-to-end neural networks for the audio-visual wake word spotting task. We ?rst process audio and video data to give them a similar structure and then train two unimodal models with uni?ed network architecture separately. Second, we propose a Hierarchical Modality Aggregation (HMA) module that fuses multi-scale audio-visual information from pre-trained unimodal models. Our system has a clear and concise framework consisting of end-to-end neural networks. With this framework and extensive data augmentation methods, our presented system achieves a false reject rate of 3.85% and a false alarm rate of 3.42% on far-?eld audio in the development set of the competition database, which ranks 2nd in the wake word spotting track of the MISP challenge.

Tags:

deep neural network

multimodal fusion

misp challenge

audio-visual wake word spotting

THE DKU AUDIO-VISUAL WAKE WORD SPOTTING SYSTEM FOR THE 2021 MISP CHALLENGE

Ming Cheng, Haoxu Wang, Ming Li, Yechen Wang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

OOD ATTACK: GENERATING OVERCONFIDENT OUT-OF-DISTRIBUTION EXAMPLES TO FOOL DEEP NEURAL CLASSIFIERS

Audio-Visual Quality Assessment for User Generated Content: Database and Method

Improving Adversarial Transferability via Feature Translation

Join the IEEE Signal Processing Society