XACC@NUS WORKSHOP SERIES 2020: RECONFIGURABLE COMPUTING SYSTEMS

- Data and Cloud
- Cloud Tea Time
- AI and Machine Learning

Speaker
Mr. Xinyu Chen, Ph.D.. Student of NUS
Associate Professor Long Zheng, Huazhong University of Science and Technology (HUST)
Dr. Chengchen Hu, Principal Engineer, Xilinx Inc.
Dr. Zeke Wang, Zhejiang University
Mr. Li Jiashu, R&D Engineer, 4Paradigm

Contact Person
Dr HE Bingsheng, Professor, School of Computing
hebs@comp.nus.edu.sg

24 Jul 2020 Friday, 09:00 AM to 12:00 PM

via Zoom

Workshop Structure:
Each talk has 25 minutes, followed by 5 minutes of Q&A.

Zoom link:
The link will be given after Registration: https://docs.google.com/forms/d/e/1FAIpQLSfhk0JRJoW7aJuRQJ5Fhmqi01sMagw_uQd62TIIaJKkVhSFuA/viewform


9:05 - 9:35am ThunderGP: Fast Graph Processing for HLS-based FPGAs - Xinyu Chen
Absract:
Graph processing has attracted a lot of attention for data analytics because graphs naturally represent the datasets of many applications. Examples include social networks, cybersecurity, and machine learning. The exponential growth of the data from these applications created a pressing need for high-performance graph processing frameworks.
With its massive parallelism and energy efficiency, FPGA is becoming attractive hardware for accelerating graph processing. In this talk, I will present ThunderGP, an efficient graph processing framework on HLS-based FPGAs, which enables data scientists to enjoy the performance of FPGA-based graph processing without compromising programmability. Two aspects make the ThunderGP deliver superior performance. On the one hand, ThunderGP embraces an improved execution flow to better exploit the pipeline parallelism of FPGA and alleviate the data access amount to the global memory. On the other hand, the memory accesses are highly optimized to fully utilize the memory bandwidth capacity of the hardware platforms.

Biodata:
Xinyu Chen is a third-year Ph.D. student at NUS working with Prof. Bingsheng He and Prof. Weng-Fai Wong. His research interests include FPGA-based hardware accelerator, hardware-software co-design, and database systems.


9:35 - 10:05am A High-Performance Graph Processing Accelerator - Long Zheng
Abstract:
Graph processing has been widely-used in many real-world scenarios. It is notoriously challenging to accelerate graph processing applications due to its well-known random memory access patterns. In this talk, I will introduce an ambitious project in China for building a graph processing computer, funded by the National Key Research and Development Program of China. We characterize some typical memory behaviors of graph processing, and rethink their architectural designs for improving memory-efficient parallelism, particularly in terms of edge parallelization. I will also introduce some industrial user cases, such as financial anti-fraud application from Ping An Technology Co. LTD, and Power and state estimation from State Grid Co. of China. Our graph accelerator can boost their performance significantly.

Biodata:
Long Zheng is an associate professor at Huazhong University of Science and Technology (HUST). He received his Ph.D. degree in computer architecture from HUST in 2016. Long has published over 30 research papers on many prestigious conferences and journals, including USENIX ATC, PACT, CGO, ICDCS, IPDPS, ACM TACO, and TPDS. He won a Best Paper Candidate in PACT 2018 and was also awarded the Best Presentation Award from CGO 2015. He has served as a guest editor board of CCF Transactions on High-Performance Computing. His current research interests include reconfigurable architecture, runtime system, and parallel programming.


10:05 - 10:35am Towards Distributed Adaptive Computing - Chengchen Hu
Abstract:
Distributed computing applications are the major driver of the cloud evolution. Their computing infrastructures have existed in a loosely-coupled, CPU-centric style for several decades, but performance improvements are now declining, due to the slowing of Moore's law, explosive data growth, and unprecedented computational demands. Increasingly, offloading computing and disaggregating resources are becoming the way forward, leveraging specialized network hardware and software to provide reliable, predictable, efficient and high-performance infrastructure. Distributed Adaptive Computing (DAC) extends the existing Xilinx per-server adaptive approach by introducing custom hardware and software interconnection technologies between Xilinx components, to provide uniquely flexible networked solutions to the Big Data challenges, scalable to address the ever-increasing demands. In this presentation, the speaker will report current research towards DAC on the following two aspects. The first is to extend the NIC taxonomy to the "Adaptable NIC", where standard or proprietary protocol processing can be programmed by users to match application needs. The second investigation has pushed adaptivity further to network switches, which has devised a new system architecture called "Adaptable Switch".

Biodata:
Chengchen Hu is a Principal Engineer at Xilinx Inc. and is the founding director of Xilinx Labs Asia Pacific based in Singapore currently leading research on networked processing systems. Prior to joining Xilinx in Aug., 2017, he was a Professor and the Department Head at the Department of Computer Science and Technology, Xi'an Jiaotong University in P. R. China. He received Ph. D. degree in Computer Science from Tsinghua University and is recipient of the New Century Excellent Talents in University award from Ministry of Education, China, a fellowship from the European Research Consortium for Informatics and Mathematics (ERCIM), a fellowship of Microsoft "Star-Track" Young Faculty. He severed in the organization committee and technical program committee of many conferences including INFOCOM, IWQoS, ICC, GLOBECOM, ANCS, Networking, APCC, etc. His main research theme is to monitor, diagnose and manage the Internet, the cloud data center networks and the distributed systems through hardware optimized and software defined approaches.


10:35 - 10:50am Q&A and Discussion - Bingsheng He


10:50 - 11:20am Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-Precision Learning - Zeke Wang
Abstract:
Learning from the data stored in a database is an important function increasingly available in relational engines. Methods using lower precision input data are of special interest given their overall higher efficiency. However, in databases, these methods have a hidden cost: the quantization of the real value into a smaller number is an expensive step. To address this issue, we present MLWeaving, a data structure and hardware acceleration technique intended to speed up learning of generalized linear models over low precision data. MLWeaving provides a compact in-memory representation that enables the retrieval of data at any level of precision. MLWeaving also provides a highly efficient implementation of stochastic gradient descent on FPGAs and enables the dynamic tuning of precision, instead of using a fixed precision level during learning. Experimental results show that MLWeaving converges up to 16X faster than low-precision implementations of first-order methods on CPUs.

Biodata:
Dr. Zeke Wang is now a ZJU100 young professor at the Center at Collaborative Innovation Center of Artificial Intelligence & Department of Computer Science, Zhejiang University, China. He got its Ph.D. degree in Instrument Science & Technology at ZJU in 2011.


11:20 - 11:50am FlashTVM: Optimizing Deep Learning Computation on OpenCL-compatible Hardware Accelerators - Jiashu Li
Abstract:
TVM is an end-to-end Deep Learning Compiler Stack, it exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. The Versatile Tensor Accelerator (VTA) is an extension of the Apache(incubating) TVM framework that exposes a RISC-like programming abstraction to describe compute and memory operations at the tensor level. The original VTA core only works on selected Xilinx Edge SoC FPGAs. Limited by hardware resources available, the performance of the current VTA core is unable to support demanding applications. At 4Paradigm, we designed and implemented a interface framework providing TVM-VTA the ability to utilize OpenCL-compatible Hardware Accelerators, including Intel and Xilinx's high-performance datacenter FPGAs.

Biodata:
Li Jiashu is an R&D Engineer at High Performance Computing Division at 4Paradigm, in charge of architecting and developing FPGA accelerators for AI applications. With more than eight years of industrial experience in RTL design and embedded systems, he is actively exploring innovative ideas to bring FPGA into both data center and edge devices. Li Jiashu received his M.Comp degree on Computer Science and B.Eng degree on Computer Engineering from National University of Singapore, and he is a recipient of Lee Kuan Yew Gold Medal.


About XACC@NUS (Xilinx Adaptive Compute Clusters at NUS): https://www.comp.nus.edu.sg/news/3371-2020-xacc-research-cluster/.
XACC@NUS will organize more talk series related to FPGAs in the future. Stay tuned.