PH.D DEFENCE - PUBLIC SEMINAR

Design Space Exploration Techniques for FPGA-based Accelerators

Speaker
Mr Zhong Guanwen
Advisor
Dr Tulika Mitra, Provost'S Chair Professor, School of Computing


26 Oct 2017 Thursday, 03:00 PM to 04:30 PM

Executive Classroom, COM2-04-02

Abstract:

The increasing complexity of FPGA-based accelerators, coupled with time-to-market pressure, makes high-level synthesis (HLS) an attractive solution to improve the designer productivity by abstracting the programming effort above register-transfer level (RTL). HLS offers various architectural design options with different trade-offs via pragmas. However, non-negligible HLS runtime renders manual or automated HLS-based exhaustive architectural exploration practically infeasible. Moreover, applications containing compute-intensive kernels can effectively leverage FPGAs to exploit fine- and coarse-grained parallelism. HLS tools, however, are inefficient in identifying and exploiting multiple levels of parallelism, thereby producing sub-optimal accelerators. To address these challenges, this dissertation focuses on developing effective and efficient HLS- and estimator-based design space exploration (DSE) techniques for FPGA-based accelerators.

First, we consider kernels containing multiple loops with or without data dependencies and propose an HLS-based DSE technique by pruning design space (loop unrolling and dataflow pragmas) and reducing number of HLS invocation. Experiments on various scientific kernels demonstrate that our DSE technique is accurate and efficient.

Considering more complex design space (e.g., loop unrolling, loop pipelining and array partitioning), the runtime of HLS tools becomes highly variable ranging from seconds to hours, which leads to the exploration time of HLS-based DSE approaches in the order of hours. Hence, we propose an estimator-based technique, Lin-Analyzer, to predict the performance of a kernel as hardware accelerators on FPGA under different combination of pragmas without actually going through the HLS. This allows Lin-Analyzer to perform rapid architectural exploration with various pragmas for FPGA-based accelerators. Experimental results confirm that Lin-Analyzer can perform DSE in the order of seconds or minutes.

Previous works above only consider kernels with small dataset. For kernels with large dataset that exceeds the FPGA storage, they need to be tiled into smaller blocks, so that the dataset of a tile can be accommodated on FPGA. Multiple tiles can be executed in parallel by instantiating several processing engines. The complexity of design space increases even further with the tile size and the number of processing engines within resource budget as additional parameters, in conjunction with the diverse HLS pragmas. Moreover, the previous works lack accurate area estimation models for complex design space. Hence, we develop a machine learning based area estimation model and propose a rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options without HLS. MPSeeker can rapidly explore complex design space in order of minutes, while identifying the near-optimal combination of pragma settings.

The above works provide an easy translation path towards acceleration of kernels on heterogeneous computing systems featuring FPGAs. To demonstrate the importance of DSE on FPGA-based accelerators and how to seamlessly and efficiently work with other on-chip computing elements on heterogeneous systems, we present a case study on accelerating convolutional neural network (CNN) applications. More specifically, we propose an automated hardware/software co-designed CNN inference framework, Synergy, on a Xilinx Zynq architecture leveraging FPGA and CPUs through multi-threading. Our result shows that Synergy delivers better throughput as well as energy-efficiency, compared to the contemporary CNN implementations on the same platform.