LEARNING ACOUSTIC FRAME LABELING FOR PHONEME SEGMENTATION WITH REGULARIZED ATTENTION MECHANISM
Binghuai Lin, Liyuan Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:48
Phoneme segmentation plays an important role in various speech processing applications such as keyword spotting, automatic pronunciation assessment, and automatic speech recognition. In this paper, we propose a method for phoneme segmentation based on a regularized attention mechanism. Specifically, the representations of speech utterance for each frame are extracted from a pre-trained acoustic encoder and combined with presumed phoneme sequences based on the attention mechanism. By fusing acoustic representations with these aligned phoneme representations, we learn phoneme labeling for each frame to obtain final segmentation. For better alignment between the pronounced phoneme sequence and utterance, we regularize the attention matrix utilizing an extra attention loss. The whole network is optimized by a multi-task learning framework (MTL). Experimental results based on the TIMIT and Buckeye corpora show the proposed method is superior to the previous baselines and reaches the state-of-the-art (SOTA) performance in F1 score and R-value.