LIPREADING MODEL BASED ON WHOLE-PART COLLABORATIVE LEARNING
Weidong Tian, Housen Zhang, Chen Peng, Zhong-Qiu Zhao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:08:30
Lipreading is a task to recognize speech content from visual information of the speaker?s lips movements. Recently, some work has focused more on how to adequately extract temporal information, and spatial information is simply used after extraction. In this paper, we focus on the full use of spatial information in lipreading tasks. The whole lip represents global spatial information, while the parts of the lip contain fine-grained spatial information. We propose the lipreading model based on whole-part collaborative learning (WPCL), which can help this model make full use of both global and fine-grained spatial information of the lip. WPCL contains two branches, which deal with the whole and the part features respectively and are trained jointly by collaborative learning. Further, in order to highlight the different importance of part features when fusing them, we propose an adaptive part features fusion module (APFF) to fusion part features. Finally, we prove our viewpoints and evaluate our WPCL by severed experiments. Experiments on LRW and CAS-VSR-W1k datasets demonstrate that our approach achieves state-of-the-art performance.