Attention-Based Multi-Encoder Automatic Pronunciation Assessment

Binghuai Lin, Liyuan Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:08:49

10 Jun 2021

Automatic pronunciation assessment plays an important role in Computer-Assisted Pronunciation Training (CAPT). Traditional methods for pronunciation assessment of reading aloud tasks utilize features derived from automatic speech recognition (ASR) and thus are sensitive to the accuracy of ASR and the effectiveness of features. Moreover, the representation capability of the features is also affected by the inconsistent optimization goals between the ASR and scoring tasks. In this paper we propose an end-to-end (E2E) pronunciation scoring network based on attention mechanism and multi-encoder consisting of audio and text encoders. The network optimized by a multi-task learning (MTL) framework can provide scoring at sentence-level as well as detailed scoring at word-level. Due to data scarcity for pronunciation scoring, we utilize ASR data and synthetic data to pre-train the network in two steps, and then fine-tune the network using the limited high-quality scoring data. Experimental results based on the dataset recorded by Chinese English-as-second-language (ESL) learners and labeled by three experts demonstrate that the proposed model outperforms the baseline in Pearson correlation coefficient (PCC).

Chairs:

Eric Fosler-Lussier

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021