Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Poster 10 Oct 2023

The high bandwidth required for gradient exchange is a bottleneck for the distributed training of large transformer models. Most sparsification approaches focus on gradient compression for convolutional neural networks (CNNs) optimized by SGD. In this work, we show that performing local gradient accumulation when using Adam to optimize transformers in distributed fashion leads to a misled optimization direction and we address this problem by accumulating the optimization direction locally. We also empirically demonstrate most sparse gradients do not overlap and thus show that sparsification is comparable to an asynchronous update. Our experiments with classification and segmentation tasks show that our method can still maintain the correct optimization direction in distributed training event under highly sparse updates.