Reviving Iterative Training With Mask Guidance For interactive Segmentation
Konstantin Sofiiuk, Ilya Petrov, Anton Konushin
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:13:35
Automatic lipreading has attracted much research interest over the past few decades. Different from English, Chinese is a tone-based language with a large alphabet and thus the correlation between Chinese characters and lip motions is more complex. Most existing methods employed an intermediate representation (usually Pinyin), and adopted a cascaded architecture for Chinese lipreading. However, such a cascaded structure may accumulate errors, and employing Pinyin as the intermediate representation would cause the loss of visual information. Moreover, these approaches do not perform well for unseen speakers due to inter-speaker variability. in this paper, we propose a cascaded Transformer-based model with a new cross-level attention mechanism, enriching the ways of information transmission between cascading structures and reducing the accumulation of errors. Multiple intermediate representations including Chinese Pinyin and the visemes are adopted to acquire multi-perspective visual and linguistic features and to improve the generalization ability for unseen speakers. Evaluations on the public sentence-level Chinese lipreading database, i.e. CMLR, have demonstrated the advantages of the proposed method in both speaker-independent and multi-speaker scenarios over state-of-the-art approaches.