CS SEMINAR

Using low precision floating-point for Distributed Deep Learning

Speaker
Mr Han Ruobing, PhD student, Georgia Institute of Technology, USA
Chaired by
Dr YOU Yang, NUS Presidential Young Professor, School of Computing
youy@comp.nus.edu.sg

22 Mar 2021 Monday, 10:00 AM to 11:30 AM

via Zoom

Abstract:
In recent years, distributed deep learning is becoming popular in industry and academia. Although researchers want to use distributed systems for training, it has been reported that the communication cost for synchronizing gradients can be a bottleneck, which limits the scalability of distributed training. Using low-precision gradients is a promising technique for reducing the bandwidth requirement. Focusing on this dilemma, we propose Auto Precision Scaling (APS), an algorithm that can improve the accuracy when we communicate gradients by low-precision floating-point values. APS can improve the accuracy for all precisions with a trivial communication cost. Our experimental results show that for both image classification and segmentation, applying APS can train the state-of-the-art models by 8-bit floating-point gradients with no or only a tiny accuracy loss. Furthermore, we can avoid any accuracy loss by designing a hybrid-precision technique. Finally, we propose a performance model to evaluate the proposed method. Our experimental results show that APS can get a significant speedup over the state-of-the-art method.


Biodata:
Ruobing Han is a PhD student at Georgia Institute of Technology, USA. Ruobing Han received his Bachelor degree in Computer Science from EECS college in Peking University. Ruobing's research interests include Parallel/Distributed Algorithms, Deep Learning, High-Performance Computing and architecture. The focus of Ruobing's current research is scaling up deep neural networks training on distributed systems. In 2018, his team broke the world record of ImageNet training speed for training AlexNet within 1.5 minutes. The corresponding paper has been accepted to IEEE transaction of Big Data and was reported by Synced, one of the most famous technology media. Ruobing has also interested in the development of open-source Deep Learning projects. Ruobing has contributed to a series of Deep Learning projects like MMDetection, MMSegmentation and MMCV. All these projects are accessible in Github and have more than 15k stars in total.