Analysis Of Multimodal Features For Speaking Proficiency Scoring In An Interview Dialogue
Mao Saeki, Yoichi Matsuyama, Satoshi Kobashikawa, Tetsuji Ogawa, Tetsunori Kobayashi
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:12:04
This paper analyzes the effectiveness of different modal- ities in automated speaking proficiency scoring in an online dialogue task of non-native speakers. Conversational compe- tence of a language learner can be assessed through the use of multimodal behaviors such as speech content, prosody, and visual cues. Although lexical and acoustic features have been widely studied, there has been no study on the usage of visual features, such as facial expressions and eye gaze. To build an automated speaking proficiency scoring system using multi- modal features, we first constructed an online video interview dataset of 210 Japanese English-learners with annotations of their speaking proficiency. We then examined two approaches for incorporating visual features and compared the effective- ness of each modality. Results show the end-to-end approach with deep neural networks achieves a higher correlation with human scoring than one with handcrafted features. Modali- ties are effective in the order of lexical, acoustic, and visual features.