Parallelizing Adam Optimizer With Blockwise Model-Update Filtering
Kai Chen, Haisong Ding, Qiang Huo
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 13:07
Recently Adam has become a popular stochastic optimization method in deep learning area. To parallelize Adam in a distributed system, synchronous stochastic gradient (SSG) technique is widely used, which is inefficient due to heavy communication cost. In this paper, we attempt to parallelize Adam with blockwise model-update filtering (BMUF) instead. BMUF synchronizes model-update periodically and introduces a block momentum to improve performance. We propose a novel way to modify the estimated moment buffers of Adam and figure out a simple yet effective trick for hyper-parameter setting under BMUF framework. Experimental results on large scale English optical character recognition (OCR) task and large vocabulary continuous speech recognition (LVCSR) task show that BMUF-Adam achieves almost a linear speedup without recognition accuracy degradation and outperforms SSG-based method in terms of speedup, scalability and recognition accuracy.