Accelerating the Evaluation of Large Workloads on Post-Dennard Systems with Sampling
COM3 Level 1
SR13, COM3 01-22
closeAbstract:
As the traditional Moore's Law-driven performance gains have plateaued with the end of Dennard scaling, computer architects adopted novel design strategies to further improve performance. This marked a radical shift in the design of next-generation computing systems, including multi-core processors, accelerators, and heterogeneous systems. Evaluating the performance of complex, realistic workloads running on these systems poses unique challenges, particularly due to the long simulation times. Sampling serves as a promising solution by intelligently selecting the representative subsets of a workload for performance evaluation. In this thesis, we explore novel methodologies to evaluate the performance of post-Dennard systems in a fast and efficient way using sampling.
To address these challenges, we first propose LoopPoint -- a sampled simulation methodology that applies to general-purpose multi-threaded workloads. LoopPoint uses application loops to demarcate regions that represent the amount of work done. We demonstrate that LoopPoint reduces the simulation time of large multi-threaded workloads from a few years to a few hours. In a follow-up work, Viper, we make use of the hierarchical structure of program execution to select regions of finer granularity suitable for RTL-level simulations. We show that naive adaptations of SimPoint or LoopPoint may not result in an optimal sample, as the application periodicity and phases vary among workloads.
Modern architectures often incorporate complex dynamic optimization techniques to improve system performance gains at runtime. However, prior sampled simulation methodologies are incapable of handling the dynamic nature of software and hardware. On this front, we propose Pac-Sim, which can be used to evaluate dynamically optimized software and hardware. Pac-Sim performs online analysis and relies on a real-time predictor to decide which regions are to be simulated in detail. This allows Pac-Sim to accurately evaluate dynamically scheduled applications, accounting for any runtime performance variability.
The increasing computational demand posed by high-performance computing (HPC) and artificial intelligence (AI) workloads is driving the shift toward heterogeneous architectures. Simulation of future heterogeneous systems is essential in understanding the interactions between compute components, but full-program simulations are prohibitively time-consuming and resource-intensive. We propose XPU-Point, a novel methodology that selects representative regions of heterogeneous CPU-GPU workloads for fast, accurate sampled simulations. XPU-Point significantly speeds up the simulation of HPC and AI workloads without compromising accuracy.
To summarize, we show that simulation solutions alone are insufficient because of the significant slowdown observed, and sampling works as an efficient technique to render the simulation of large workloads tractable. We evaluate a variety of multi-core and heterogeneous workloads to develop methodologies that accelerate the performance evaluation and design space exploration of novel architectures.
Bio:
Alen is a sixth-year PhD candidate in computer science at the National University of Singapore. His research interests lie broadly in the areas of computer architecture, performance measurements, and simulation methodologies. Most of his prior research focused on building fast and accurate simulation methodologies, performance evaluation tools, and workload reduction techniques targeting multi-core CPU systems and heterogeneous CPU-GPU systems.