Multithreaded Parallel Data Structures
We study the problem of designing and analyzing multithreaded parallel data structures for asynchronous multi-processor machines with provable performance bounds that are both cost-scalable and composable. That is, we want to be able to run programs using such data structures on p processors and obtain nearly linear speedup as we increase p, and we want to be able to easily use them as black boxes.
As our first key contribution to this goal, we delineate and work within a high-level DMT (dynamic multithreading) model with a small but powerful set of thread primitives and standard RMW (read-modify-write) operations. We introduce a new 'span'-type notion called delay, and prove various properties for it, which show that delay bounds are more easily composable than span bounds. In particular, for any multithreaded program represented by a program DAG (directed acyclic graph) in which a node is either a basic instruction or a data structure call, we can bound the delay of the entire computation based on just the program DAG and a delay bound on each data structure call, and ultimately bound the time taken by the entire computation when run on a sufficiently efficient scheduler. In particular, we show that any DMT program that takes w work and s delay when run on p processors using a greedy scheduler takes O(w/p+s) time, extending the much weaker well-known result that running a DMT program that takes w work and s span on p processors using a greedy scheduler takes O(w/p+s) time.
Our second key contribution comprises basic building blocks implemented within the DMT model, which are extremely useful in designing other algorithms and data structures, and which have very good performance even under a queued memory contention model. These building blocks include higher synchronization primitives, basic batch operations, parallel entropy-sorting, and the extended implicit batching technique (which requires a slightly stronger DMT+ model). The extended implicit batching technique is an extension of the implicit batching technique introduced by Agrawal et al. The motivation behind implicit batching is that we want to easily obtain an efficient atomic data struture from a batched data structure. This would drastically simplify the design of efficient atomic data structures, since we can just design efficient batched data structures that process operations in batches. The original technique only supports implicit batching of a batch-parallel data structure (i.e. which supports only one batch of operations at a time), whereas the extended technique supports implicit batching of a much larger class of batched data structures including pipelined data structures. Furthermore, the original technique only supports a single implicitly batched data structure, whereas the extended technique supports multiple implicitly batched data structures.
Our third key contribution showcases a number of batched data structures implemented within the DMT model, designed using these building blocks. First of all, we design the batch-parallel 2-3 tree T, which supports performing certain batch operations with information-theoretically optimal work/span bounds, and is unsurprisingly a key ingredient in many other parallel data structures. T can be used to easily construct other useful batch-parallel data structures, and in this seminar we shall talk about one of them, a simple parallel map M. Applying implicit batching to M yields an atomic map with good effective work/span bounds. We also show the results of simple empirical experiments to evaluate T on a real multi-core machine, which validate the theoretical performance bounds.
In the thesis we proceed to demonstrate more complex parallel data structures, namely a parallel working-set map WM and a parallel finger structure FS, both of which are distribution-sensitive, meaning that when they are implicitly batched and hence appear atomic, the cost of each operation on them depends on the sequence of operations according to some linearization. We give not only a simpler batch-parallel version but also a more complicated pipelined version that yields a better ‘span’ term in the performance bounds. Roughly speaking, each of WM and FS is work-optimal with respect to the corresponding distribution-sensitive property (i.e. working-set property or finger property) when it has at least p items. The simpler versions WM1,FS1 effectively take O((log p)^2+log n) span per operation, whereas the faster versions WM2,FS2 effectively take only O((log p)^2) span per operation more than optimal according to the respective distribution-sensitive property, on any sequence of searches, insertions or deletions. In this seminar we shall only talk about FS.
To our knowledge, T,M,WM,FS are the first multithreaded data structures that meet the desired specifications and performance bounds even with asynchronous processes and a queued memory contention model.