Boosting BERT Subnets with Neural Grafting
Ting Hu (Hasso Plattner Institute); Christoph Meinel (Hasso Plattner Institute); Haojin Yang (Hasso-Plattner-Institut für Digital Engineering gGmbH)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Pre-trained Language Models in Natural Language Processing have become increasingly computationally expensive and memory demanding. The recently proposed computation-adaptive BERT models facilitate their deployment in practical applications. Training such a BERT model involves jointly optimizing subnets of varying sizes, which is not easy due to their mutual interference with one another. The larger-size subnets in particular could deteriorate when there is a large performance gap between the smallest subnet and the supernet. In this work, we propose Neural grafting to boost BERT subnets, especially the larger ones. Specifically, we regard the less important sub-modules of a BERT model as less active and reactivate them via layer-wise Neural grafting. Experimental results show that the proposed method improves the average performance of BERT subnets on six datasets of GLUE benchmark. The subnet performing comparable to the supernet BERT-Base reduces around 67% and 70% inference latency on GPU and CPU, respectively. Moreover, we compare two Neural grafting strategies under varied experimental settings, hoping to shed light on the application scenarios of Neural grafting.