Skip to main content

WL-MSR: Watch and Listen for Multimodal Subtitle Recognition

Jiawei Liu (Institute of Automation, Chinese Academy of Sciences and School of Artificial Intelligence, University of Chinese Academy of Sciences); Hao Wang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences and School of Artificial Intelligence, University of Chinese Academy of Sciences); Weining Wang ( The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences); Xingjian He (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences and School of Artificial Intelligence, University of Chinese Academy of Sciences); Jing Liu (National Lab of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Video subtitles could be defined as the combination of visualized subtitles in frames and textual content recognized from speech, which play a significant role in video understanding for both humans and machines. In this paper, we propose a novel Watch and Listen for Multimodal Subtitle Recognition (WL-MSR) framework to obtain comprehensive video subtitles, by fusing the information provided by Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) models. Specifically, we build a Transformer model with mask and crop strategies and multi-level identity embeddings to aggregate both the textual results and features of the two modalities. To pre-filter out the noise items in OCR results before fusion, we adopt an OCR filter based on ASR results and confidence scores of OCR. By combining these techniques, our solution wins the 2nd place in Multimodal Subtitle Recognition Challenge on ICPR2022.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00