TWEO: FP8 Training And Quantization For Dummies
COM1 Level 3
MR1, COM1-03-19

Abstract:
Native FP8 support is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. We fundamentally challenge the conventional wisdom that outliers are data-driven, and demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training. In this talk, I will introduce TWEO, a novel, non-invasive solution. TWEO effectively prevents extreme outliers (from 10000+ to < 20). It is very simple, neatly enables full-model FP8 pre-training for both LLM and ViT, achieves performance comparable to the BF16 baseline, while delivers a 36\% increase in training throughput. Also, TWEO enables a new quantization paradigm: hardware-friendly W8A8 per-tensor static quantization of LLMs.
Bio:
Jianxin Wu received his BS and MS degrees from Nanjing University, and PhD degree from the Georgia Institute of Technology, all in computer science. He is a professor in the School of Artificial Intelligence at Nanjing University and the National Key Laboratory for Novel Software Technology, China. He has served as a program chair for CVPR'24, (senior) area chair for NeurIPS, CVPR, ICCV, ECCV, AAAI and IJCAI, and as an associate editor for IEEE T-PAMI. His research interests are computer vision and machine learning.

