IMPROVING THE CLASSIFICATION OF PHONETIC SEGMENTS FROM RAW ULTRASOUND USING SELF-SUPERVISED LEARNING AND HARD EXAMPLE MINING
Yunsheng Xiong, Kele Xu, Yong Dou, Meng Jiang, Liang Cheng, Jinjia Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:08:00
Ultrasound tongue imaging is an attractive way for speech production study as it provides an effective visualization for the vocal tract. Automatic classification of phonetic segments (tongue shapes) from raw ultrasound data is vital for further interpretation. Recently, deep learning-based approaches have been adopted in this task, which required a large-scale annotated dataset for the training, and it is not easy to be obtained. Moreover, the data may contain many hard examples for the classification task, due to contamination of speckle noise. In this paper, we aim to address these issues: firstly, self-supervised learning is adopted to utilize the unlabeled datasets and extract the features without any human annotations; secondly, hard example mining is applied to imitate the learning path of the clinical linguists. To empirically demonstrate the proposed method's effectiveness, we evaluate the method on the Ultrax Typically Developing dataset (UXTD) under different scenarios. The results show that the proposed method outperforms the other methods and achieves superior performance.