Concurrent Graph Query Processing on Many-Core Architectures
Abstract:
Graphs are the de facto data structures in various applications such as social network analysis, bioinformatics, online transaction analysis, and weblink analysis. Witnessing the emergence of modern commodity machines with massively parallel many-core processors, including CPUs and GPUs, researchers and practitioners have made substantial efforts on efficient parallel graph frameworks and systems. However, we notice that fundamental algorithms and primitives provided by these systems and frameworks are insufficient to support concurrent graph queries (CGQs) of emerging applications in many fields such as routing autonomous vehicles, network community profile analysis, and machine learning on graphs. The unique feature of these applications is that they launch many independent CGQs from different source vertices on the same graph. In this thesis, we focus on accelerating the CGQ processing, which, in practice, takes an overwhelming majority of the execution time (>90%) in CGQ applications. We identify three major issues that challenge the efficiency of CGQ processing. First, executing many CGQs simultaneously can require a significant amount of memory and computational resources. Second, executing CGQs by taking advantage of the inter-query parallelism among them could suffer severe contention in hardware resources. Third, the CGQ processing can perform substantially repetitive operations since the queries are homogeneous and executed on the same graph.
We present the design and implementation of graph processing systems for efficient CGQ processing, aiming at high-performance on many-core architectures (GPUs and CPUs). Particularly, first, we present a framework on GPUs, Vine, that addresses the first challenge, i.e., how to deal with a large number of CGQs that are subproblems in finding the exact solutions for constrained shortest path (CSP) problem, which require a significant amount of memory and computational resources. We use GPU accelerations because we find that even with CPU parallelization, the costly subproblem solving overhead is still not able to achieve practical performance for CSP. We develop a two-level pruning approach to eliminate the subproblems by making good use of the GPU's hierarchical memory and propose an adaptive parallelism control model based on the observations that the degree of parallelism is the key to performance optimization with the given computational resources. Our experiments show that Vine has over 5x speedup compared with a GPU baseline. Compared to the state-of-the-art approximate solution with preprocessed indices, Vine provides exact results with competitive or even better performance.
Second, we present a general framework, ForkGraph, that leverages the spatial sharing of the cache among CGQs on multi-core CPUs. To mitigate the high cache miss ratio due to contention among CGQs, graph operations are processed in a coordinated manner by buffering the memory accesses based on the cached partition in last-level cache. Specifically, we divide a graph into partitions that fit into the LLC of multi-core CPUs. The CGQs are buffered and executed on the partition basis. We further develop efficient intra- and inter-partition execution strategies for efficiency. For intra-partition processing, since the graph partition fits into LLC, we propose to execute each graph query with efficient sequential algorithms (in contrast with parallel algorithms in existing parallel graph processing systems) and present an atomic-free query processing method by consolidating contending operations to cache-resident graph partition. For inter-partition processing, we propose two designs, yielding and priority-based scheduling, to reduce redundant work in processing. We theoretically prove that ForkGraph performs the same amount of work, to within a constant factor, as the fastest known sequential algorithms in CGQ processing, which is work efficient. Our evaluations on real-world graphs show that ForkGraph significantly outperforms state-of-the-art graph processing systems with two orders of magnitude speedups.
Third, we further leverage memoization on graphs to exploit the computation sharing among CGQs by facilitating the processing of some queries using the results of other processed queries. We then present KGraph, a general framework to execute CGQs efficiently, with higher performance than ForkGraph on most of the CGQ applications. Since directly applying memoization hardly scales up on large graphs due to the high overhead involved and a low workload reduction ratio, we propose two techniques in KGraph to improve the efficiency of memoization on graphs. First, we develop a fine-grained approach by partitioning the graph with a cost model to determine the granularity of graph partitioning. Specifically, we only memoize query results and apply memoization within each partition to reduce the overhead involved. Second, we propose to pre-select pivot queries to memoize since the results of these queries could have high contributions to CGQs. Instead of exploiting the sharing among CGQs, we leverage the sharing opportunities between them and the memoized pivot queries for high memoization effectiveness. We perform a comprehensive analysis of KGraph's performance using five popular CGQ applications. The experiments show that our system achieves an average speedup of 2.4x over our previous work.