SPEAKER-INDEPENDENT LIPREADING WITH LIMITED DATA
Chenzhao Yang, Shilin Wang, Xingxuan Zhang, Yun Zhu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 08:05
Recent researches have demonstrated that with a huge annotated training dataset, some sophisticated automatic lipreading methods perform even better than a professional human lip reader. However, when the training set is limited, i.e. containing a few number of speakers, most existing lipreading approaches cannot provide accurate recognition results for unseen speakers due to the inter-speaker variability. To improve the lipreading performance in the speaker-independent scenario, a new deep neural network (DNN) is proposed in this paper. The proposed network is composed of two parts, i.e. the Transformer-based Visual Speech Recognition Network (TVSR-Net) and the Speaker Confusion Block (SC-Block). The TVSR-Net is designed to extract lip features and recognize the speech. The SC-Block aims to achieve speaker normalization by eliminating the influence of various talking styles/habits. A Multi-Task Learning (MTL) scheme is designed for network optimization. Experiment results on the GRID dataset have demonstrated the effectiveness of the proposed network on speaker-independent recognition with limited training data.