Improving BERT Fine-tuning via Stabilizing Cross-layer Mutual Information
Jicun Li (1. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2. University of Chinese Academy of Sciences, Beijing, China); Xingjian Li (1. Big Data Lab, Baidu Research; 2. State Key Lab of IOTSC, University of Macau); Tianyang Wang (University of Alabama at Birmingham); Shi Wang ( 1. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2. University of Chinese Academy of Sciences, Beijing, China); Yanan Cao (Institute of Information Engineering, Chinese Academy of Sciences); Cheng-Zhong Xu (University of Macau); Dejing Dou (Baidu)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Fine-tuning pre-trained language models, such as BERT, has shown enormous success among various NLP tasks. Though simple and effective, the process of fine-tuning has been found unstable, which often leads to unexpected poor performance. To increase stability and generalizability, most existing works resort to maintaining the parameters or representations of pre-trained models during fine-tuning. Nevertheless, very little work explores mining the reliable part of pre-learned information that can help to stabilize fine-tuning. To address this challenge, we introduce a novel solution in which we fine-tune BERT with stabilized cross-layer mutual information. Our method aims to preserve the reliable behaviors of cross-layer information propagation, instead of preserving the information itself, of the pre-trained model. Therefore, our method circumvents the domain conflicts between pre-trained and target tasks. We conduct extensive experiments with popular pre-trained BERT variants on NLP datasets, demonstrating the universal effectiveness and robustness of our method.