Raw Ultrasound-based Phonetic Segments Classification Via Mask Modeling
kang you (Shanghai Jiao Tong University); Bo Liu (National University of Defense Technology); Kele Xu (National Key Laboratory of Parallel and Distributed Processing (PDL)); Yunsheng Xiong (National University of Defense Technology); Qisheng Xu (National University of Defense Technology); Ming Feng (Tongji University); Tamás G Csapó (Budapest University of Technology and Economics); Boqing Zhu (National University of Defense Technology)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Ultrasound tongue imaging (UTI) is widely used in clinical linguistics and phonetics. Recently, deep neural networks, especially convolutional neural networks, have been widely used in the interpretation and analysis of ultrasound tongue images. Despite achieving satisfactory performance, the method relies on a large amount of manually labeled data, which is often difficult to obtain in practical settings. To address this issue, this paper focuses on how to utilize a large amount of unlabeled UTI data to improve the performance of UTI classification task. Specifically, we explore self-supervised learning with masking strategies. By predicting the masked part, our pre-trained part enables the neural network to infer contextual information. Then, we fine-tune the pre-trained model with a small amount of labeled data. Compared with the previous competing algorithms, our method can improve the classification accuracy by an average of 13.33% in four different scenarios.