Efficient Time-Energy Execution of Data-Parallel Applications on Heterogeneous Systems with GPU
COM2 Level 4
Executive Classroom, COM2-04-02
closeAbstract:
The last decade has seen the exponential growth of data and the advent of data-parallel processing frameworks such as Google's Cloud Dataflow and MapReduce. On the other hand, hardware systems have entered the heterogeneity era where multiple processing units with different performance-to-power ratio are combined into a single system. At the same time, low-power (wimpy) systems traditionally used in mobile devices have made significant improvements in performance and are targeting server system market dominated by high-performance (brawny) x86-64 systems. In this context, it is important to study the efficiency of running data-parallel applications on heterogeneous systems.
In this thesis, we propose techniques for efficient execution of data-parallel processing on heterogeneous systems with GPUs. Our lazy processing technique enables the parallel processing of multiple input records on the GPU in contrast with chunking of a single record among GPU threads. At runtime, our one-time dynamic mapping technique selects the best execution unit for data-parallel processing between the CPU and GPU. This approach is implemented in MoSS, a Hadoop-CUDA framework that we have developed. Compared to Hadoop, MoSS reduces the execution time by a factor of up to 2.3 on brawny systems, and 3.1 on wimpy systems together with a maximum energy reduction of 80% for compute-intensive workloads. On average, MoSS is over 50% faster compared with the chunking approach.
Secondly, we perform a measurement-driven analysis of MapReduce on intra-node heterogeneous systems with (i) ARM big.LITTLE CPU and (ii) discrete and integrated GPU. Our analysis of ARM big.LITTLE systems shows that there is no one size fits all rule for efficient data-parallel processing on these systems. However, small memory size, low memory and I/O bandwidth, and software immaturity concur in canceling the lower-power advantage of ARM systems. Our analysis of heterogeneous systems with both discrete and integrated GPUs reveals that wimpy systems with integrated GPU use the lowest energy due to more energy-efficient hardware and better balanced system resources. Based on this finding, we establish an equivalence ratio between a single brawny heterogeneous node and multiple wimpy heterogeneous nodes. We show that multiple wimpy nodes achieve the same time performance as a single brawny node, while saving up to two-thirds of the energy.
Thirdly, we design measurement-driven time-energy analytic models to determine the execution time and energy usage of data-parallel execution on both homogeneous systems running Hadoop and heterogeneous systems running MoSS. To the best of our knowledge, we are the first to design an energy usage model for MapReduce execution. Since our modeling approach uses baseline measurements to increase model accuracy, the validation on up to 264 system configurations shows an average model error of less than 15%. Using our models, we analyze the performance of hypothetical scale-out clusters with more than 100 nodes. This analysis shows that heterogeneity always achieves better time-energy performance when the workload consists of a compute-intensive part. In line with our measurement-driven analysis, we show using our models that multiple wimpy nodes not only achieve similar execution times compared to brawny nodes, but also exhibit energy savings of up to 90% for compute-intensive workloads. This, together with MoSS performance results, advocate the potential usage of wimpy systems with integrated GPU for data-parallel processing