MMCOSINE: MULTI-MODAL COSINE LOSS TOWARDS BALANCED AUDIO-VISUAL FINE-GRAINED LEARNING

Ruize Xu (Renmin University of China); Ruoxuan Feng (Renmin University of China); Shixiong Zhang (Tencent); Di Hu (Renmin University of China)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model's performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise L2 normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric. Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.

Tags:

Machine learning for image processing

MMCOSINE: MULTI-MODAL COSINE LOSS TOWARDS BALANCED AUDIO-VISUAL FINE-GRAINED LEARNING

Ruize Xu (Renmin University of China); Ruoxuan Feng (Renmin University of China); Shixiong Zhang (Tencent); Di Hu (Renmin University of China)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

RETIFORMER: RETINEX-BASED ENHANCEMENT IN TRANSFORMER FOR LOW-LIGHT IMAGE

Learning Supervised Covariation Projection Through General Covariance

Learning Generalizable Light Field Networks from Few Images

Join the IEEE Signal Processing Society