Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:05:30
09 May 2022

Phoneme mispronunciation detection plays an important role in Computer-Assisted Pronunciation Training. Traditional methods either rely on phone recognition, which has the limitation of incapability of detecting mispronounced phonemes out of the dictionary, or the need of external phoneme alignment for extracting acoustic features. In this paper, we propose a method for phoneme mispronunciation detection by jointly learning to align. Specifically, we first obtain acoustic and canonical phoneme representations utilizing acoustic and phoneme encoders. Second, we utilize an attention mechanism to fuse acoustic features of each frame and phoneme representations. Finally, a convolutional neural network (CNN)-based layer following the fused representations is utilized for better exploring local context. The network is jointly optimized for phoneme mispronunciation and phoneme alignment based on a multi-task learning framework. Experimental results based on a public dataset L2-ARCTIC show the state-of-the-art (SOTA) performance with an F1-score of 63.04%. It is also found that optimizing the phoneme alignment can further improve the performance of phoneme mispronunciation detection.