Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 06:50
27 Oct 2020

Audio-visual speech recognition (AVSR) integrates both audio and visual information to perform automatic speech recognition (ASR), which improves the robustness of human-robot interaction systems especially in noise environments. However, few methods and applications have paid attention to AVSR in tonal languages, in which the linguistic feature can play an important role as well as visual information. In this work, we propose a method for AVSR in Mandarin based on adaptive decision fusion as well as making full use of tone features. Firstly, we introduce tone features calculated by Constant Q trasform (CQT) and put them into a CNN-based audio network together with Mel-Frequency Cepstral Coefficient (MFCC) audio features. Then, the visual features are extracted by Discrete Cosine Transform (DCT) from mouth regions in video frames and modeled by an LSTM-based visual network. Finally, an adaptive decision fusion network combines the outputs from both streams to make final predictions. Experimental results on the PKU-AV2 dataset show that the tone features can significantly improve the robustness of Mandarin speech recognition systems, and the adaptability of the proposed method to various noise environments.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00