PH.D DEFENCE - PUBLIC SEMINAR

Parallel Graph Processing Accelerators on FPGAs

Speaker
Mr Chen Xinyu
Advisor
Dr He Bingsheng, Professor, School of Computing


24 Mar 2022 Thursday, 02:00 PM to 03:30 PM

Zoom presentation

Abstract:

Due to the failure of Dennard scaling and the appearance of Dark Silicon, successive CPU generations exhibit diminishing performance returns.
Allowing to customize application-specific hardware logic, field-programmable gate arrays (FPGAs) demonstrate high performance and energy efficiency with reconfigurability. Hence, there is a surge of interest in adopting FPGA-based accelerators for various applications.
Moreover, compared to hardware description languages (HDLs), high-level synthesis (HLS) improves programmability and usability for FPGA-based designs.

Graphs are de facto data structures to represent different relationships of entities in many emerging big data applications, e.g., data science and machine learning. The exponential growth of data from these applications has created a pressing demand for high-performance graph processing.
Subsequently, graph processing systems have become a hot research topic in both academia and industry. Despite a wealth of existing efforts on developing graph processing systems for improving the performance and/or energy efficiency on traditional architectures, graph processing accelerators are essential and emerging to provide benefits significantly beyond pure software solutions can offer. This thesis explores the optimizations on performance and energy efficiency of FPGA-based customized computing for graph processing.

First, in current FPGA-based graph processing accelerators, a number of techniques such as caching {using Block RAMs (BRAMs)} to reduce random accesses and multiple processing elements (PEs) for high throughput have been explored. However, many of those techniques become challenging for HLS-based FPGA designs because the runtime data dependency introduced by multiple PEs is usually poorly handled by HLS's high-level control granularity. We solve this problem using a novel on-the-fly parallel data shuffling technique. We also integrate our shuffling technique with an edge-centric graph processing framework which achieves more than 1,000 million traversed edges per second (MTEPS) throughput on PageRank, SpMV, BFS, and SSSP applications, which is as efficient as, and at times even better than, existing HDL-based designs.

Second, although many works have been proposed to design efficient FPGA-based accelerators for graph processing, the largely overlooked programmability still requires hardware design expertise and sizable development efforts from developers. In order to close the gap, we design and implement an open-source HLS-based graph processing framework on FPGAs called ThunderGP.
ThunderGP enables data scientists to enjoy the performance of FPGA-based graph processing without compromising programmability.
We evaluate ThunderGP with seven common graph applications. The results show that accelerators on real hardware platforms deliver 2.9 times speedup over the state-of-the-art approach, running at 250MHz, and achieving throughput up to 6,400 MTEPS.

Third, recent memory subsystem upgrades including the introduction of HBM in FPGAs promise to further alleviate memory bottlenecks of graph processing. However, modern multi-channel HBM requires much more computational capacity to fully utilize its bandwidth potential. Existing designs result in the underutilization of the HBM facilities even when all other resources are fully consumed. We propose a resource-efficient heterogeneous pipeline architecture that scales graph processing on HBM-enabled FPGAs. The heterogeneous pipelines are tailored for more specific memory access patterns within graph processing, and hence are more lightweight, allowing the architecture to scale up more effectively with limited resources.

Furthermore, we develop an automated open-source framework, called ReGraph, which automates the entire development process. ReGraph outperforms state-of-the-art FPGA accelerators by up to 5.9 times and is up to 18 times more energy-efficient and 6.7 times faster than the state-of-the-art GPU solutions.