PH.D DEFENCE - PUBLIC SEMINAR

Towards Efficient Transformer Scaling

Speaker
Mr. Xue Fuzhao
Advisor
Dr You Yang, Nus Presidential Young Professor, School of Computing


12 Sep 2024 Thursday, 02:00 PM to 03:30 PM

MR20, COM3-02-59

Abstract:

In recent years, Transformer-based deep learning models have exhibited remarkable performance across a myriad of tasks. A pivotal advantage of the Transformer architecture lies in its scalability, spanning dimensions such as dataset size, parameter count, and computational budget. This scaling capability empowers Transformers to attain substantial improvements and even unlock novel capabilities, enabling the accomplishment of tasks previously deemed impossible.

However, the pursuit of scaling comes at a considerable cost, limiting the progress of deep learning due to resource constraints. This thesis addresses this challenge by exploring a series of strategies to enhance the efficiency of Transformer scaling.

Firstly, the introduction of more trainable parameters can significantly enhance performance but demands increased memory usage. To address this trade-off, we present WideNet, a model that optimizes parameter efficiency by leveraging parameter-sharing and Mixture-of-Experts, achieving superior results in both computer vision and natural language tasks.

Secondly, when training different transformer models with distinct objectives at the same scale, we often adopt uniform configurations, such as width and depth. Our investigation into the relationship between transformer configuration and training objectives reveals that token-level training aligns better with deeper and narrower configurations, while sequence-level training encounters challenges in scaling depth due to over-smoothing.

Motivated by real-world applications requiring processing of lengthy input sequences (e.g., document understanding and medicinal image processing), we focus on scaling the transformer along sequence length from a training system perspective. Our sequence parallelism approach achieves 27 times increase in maximum sequence length compared to previous methodologies.

Transformers face limitations in handling fixed computation budgets at each scale, necessitating the deployment of multiple models at different scales to cater to diverse service levels. To address this, we introduce AdaTape, which enables adaptive computation with elastic input sequences, offering an improved cost-effectiveness trade-off and greater flexibility in utilizing foundation models.

Lastly, recent insights from the transformer scaling community highlight the underestimated significance of dataset size. Rather than scaling trainable parameters faster than the dataset, achieving compute-optimal results requires a proportional scaling of model parameters and training tokens. Our exploration into dataset scaling reveals potential limitations in further scaling up large language models, prompting ongoing research into this emerging challenge.