LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES

Yiming Chen, Nikolaos Deligiannis

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Poster 10 Oct 2023

The high bandwidth required for gradient exchange is a bottleneck for the distributed training of large transformer models. Most sparsification approaches focus on gradient compression for convolutional neural networks (CNNs) optimized by SGD. In this work, we show that performing local gradient accumulation when using Adam to optimize transformers in distributed fashion leads to a misled optimization direction and we address this problem by accumulating the optimization direction locally. We also empirically demonstrate most sparse gradients do not overlap and thus show that sparsification is comparable to an asynchronous update. Our experiments with classification and segmentation tasks show that our method can still maintain the correct optimization direction in distributed training event under highly sparse updates.

Tags:

distributed learning

gradient compression

optimization

vision transformer

LOCALLY ACCUMULATED ADAM FOR DISTRIBUTED TRAINING WITH SPARSE UPDATES

Yiming Chen, Nikolaos Deligiannis

More Like This

Robust Aggregation for Federated Learning

Slides: Robust Aggregation for Federated Learning

Slides: Pol-InISAR Approach for 3D Imaging of Non-Cooperative Targets

Join the IEEE Signal Processing Society