ROBUST AUDIO-VISUAL MANDARIN SPEECH RECOGNITION BASED ON ADAPTIVE DECISION FUSION AND TONE FEATURES
Hong Liu, Zhengyan Chen, Wei Shi
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 06:50
Audio-visual speech recognition (AVSR) integrates both audio and visual information to perform automatic speech recognition (ASR), which improves the robustness of human-robot interaction systems especially in noise environments. However, few methods and applications have paid attention to AVSR in tonal languages, in which the linguistic feature can play an important role as well as visual information. In this work, we propose a method for AVSR in Mandarin based on adaptive decision fusion as well as making full use of tone features. Firstly, we introduce tone features calculated by Constant Q trasform (CQT) and put them into a CNN-based audio network together with Mel-Frequency Cepstral Coefficient (MFCC) audio features. Then, the visual features are extracted by Discrete Cosine Transform (DCT) from mouth regions in video frames and modeled by an LSTM-based visual network. Finally, an adaptive decision fusion network combines the outputs from both streams to make final predictions. Experimental results on the PKU-AV2 dataset show that the tone features can significantly improve the robustness of Mandarin speech recognition systems, and the adaptability of the proposed method to various noise environments.