MULTI-MODAL LEARNING WITH TEXT MERGING FOR TEXTVQA

Changsheng Xu, Zhenlong Xu, Yifan He, Shuigeng Zhou, Jihong Guan

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:09

09 May 2022

Text visual question answer(TextVQA) is an important task of visual text understanding, which requires to understand the text generated by text recognition module and provide correct answers to specific questions. Recent works of TextVQA have tried to combine text recognition and multi-modal learning. However, due to the lack of effective preprocessing of text recognition output, existing approaches suffer from serious contextual information missing, which leads to unsatisfactory performance. In this work, we propose a Multi-Modal Learning framework withText Merging~(MML&TM in short) for TextVQA, where we develop a text merging (TM) algorithm, which can effectively merge the word-level text obtained from the text recognition module to construct line-level and paragraph-level texts for enhancing semantic context, which is crucial to visual text understanding. The TM module can be easily incorporated into the multi-modal learning framework to generate more comprehensive answers for TextVQA. We evaluate our method on a public dataset ST-VQA. Experimental results show that our TM algorithm can obtain complete semantic information, which subsequently helps MML&TM generate better answers for TextVQA.

Tags:

multi-modal learning

visual text understanding

text visual question answer

text merging.

MULTI-MODAL LEARNING WITH TEXT MERGING FOR TEXTVQA

Changsheng Xu, Zhenlong Xu, Yifan He, Shuigeng Zhou, Jihong Guan

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

TAMM: A TASK-ADAPTIVE MULTI-MODAL FUSION NETWORK FOR FACIAL-RELATED HEALTH ASSESSMENTS ON 3D FACIAL IMAGES

LOCAL-GLOBAL CONTRAST FOR LEARNING VOICE-FACE REPRESENTATIONS

UBILUNG: MULTI-MODAL PASSIVE-BASED LUNG HEALTH ASSESSMENT

Join the IEEE Signal Processing Society