THE DKU AUDIO-VISUAL WAKE WORD SPOTTING SYSTEM FOR THE 2021 MISP CHALLENGE
Ming Cheng, Haoxu Wang, Ming Li, Yechen Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:08:13
This paper describes the system developed by the DKU team for the MISP Challenge 2021. We present a two-stage approach consisting of end-to-end neural networks for the audio-visual wake word spotting task. We ?rst process audio and video data to give them a similar structure and then train two unimodal models with uni?ed network architecture separately. Second, we propose a Hierarchical Modality Aggregation (HMA) module that fuses multi-scale audio-visual information from pre-trained unimodal models. Our system has a clear and concise framework consisting of end-to-end neural networks. With this framework and extensive data augmentation methods, our presented system achieves a false reject rate of 3.85% and a false alarm rate of 3.42% on far-?eld audio in the development set of the competition database, which ranks 2nd in the wake word spotting track of the MISP challenge.