Fixed And Floating Point Precision Optimized Approximation On Embedded And Parallel Architectures
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract:
The recent emergence of approximable programs such as deep neural networks, big data analytics, media/video processing and simulation applications sparked a general interest in using approximation techniques to exploit the trade-off between energy and tolerable accuracy. Among the approximation techniques, number precision reduction was widely adopted due to the ubiquity of real number arithmetic in programs. The reduction of precision comes in terms of the number of bits used to represent the number on various hardware platforms. From low bitwidth fixed point formats on FPGAs to 8 bits, half precision, single and double precision floating point formats on CPUs, GPUs and new customized accelerators. The works in this thesis focus on the central problem of exploiting these number formats on various hardware platforms to provide a tolerable output accuracy while speeding up the target programs. The main focused platforms are embedded systems and emerging parallel architecture of GPUs.
For embedded systems, we introduce a set of methods for precisely allocating floating point and fixed point number bitwdith to each variable in the program given an accuracy threshold. With the principle that the lower the bitwidth, the lower the energy consumed, our algorithm minimizes the number of bits in the representation down to single-bit granularity. As an extension to this work, we studied the real world use case of customized hardware which is designed to support fixed-point precision for energy-hungry machine learning applications. Our extension can analyze compute-intensive deep neural networks to the bit-layer level of granularity to conserve any potential energy amount possible. Furthermore, it can adaptively allocate bit precision to each layer according to hardware-design constraints such as budget or memory bandwidth.
For adopting low precision number formats on GPU architecture, we first introduce code rewriting techniques to exploit the half precision number format in modern GPUs. It paves the way for applying mixed precision to CUDA programs. As a result, we provide a framework to support mixed precision tuning on CUDA code which can make use of the half precision datatype. However, due to the limited range of half precision, naively using it in conventional CUDA programs may give worse performance with unusable output. To overcome such limitation, we introduce a set of techniques to enable low-error approximation which yields higher accuracy and faster programs for some applications that cannot take advantage of conventional half precision arithmetic.
By developing and testing these techniques on a variety of applications across multiple architectures, the works in this thesis answered some questions in the literature: How should low precision number formats be used efficiently in emerging architectures and what is their actual impact on error and performance of real applications. With the recent trend of realizing low precision number formats on hardware accelerators, we believe the studies in this thesis can contribute to the bigger picture of exploiting these ubiquitous number formats in the near future.