A MULTI-MODAL TRANSFORMER APPROACH FOR FOOTBALL EVENT CLASSIFICATION
Yixiao Zhang, Baihua Li, Hui Fang, Qinggang Meng
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Video understanding has been enhanced by the use of multi-modal networks. However, recent multi-modal video analysis models have limited applicability to sports videos due to their specialised nature. This paper proposes a novel attention-based multi-modal neural network for sports event classification featuring a multi-stage fusion training strategy. The proposed multi-modal neural network integrates three modalities, including an image sequence modality, an audio modality and a newly proposed sports formation modality, to improve the sports video classification performance. Empirical results show that the proposed model outperforms the state-of-the-art transformer-based video method by 4.43% on top-1 accuracy on Soccernet-V2 dataset.