Efficient OLAP for Big Data
17 Oct 2018 Wednesday, 02:30 PM to 04:00 PM
COM2 Level 4
Executive Classroom, COM2-04-02
OLAP faces plenty of challenges in the era of big data. The exploded volume of the immutable data hinders OLAP system achieving high service level agreement. Further, the traditional OLAP system can hardly support new analysis patterns like temporal dependence discovery in user behavioral data. In this dissertation, we present a general framework for emerging OLAP system in a bottom-up manner, starting from the basic indexing component and the core analysis engine to a high-level database system.
The motivation of the first work is due to the fact that coarse granularity of data accesses and the heavy use of latches, indices in the B-tree family are not efficient for in-memory databases, especially in the context of today's multi-core architecture. Consequently, we study the parallelizability of skip lists for the parallel and concurrent environment, and present PSL, a Parallel in-memory Skip List that lends itself naturally to the multi-core environment, particularly with non-uniform memory access.
Moreover, five state-of-the-art indices, including FAST, Masstree, BwTree, ART and PSL, are selected in a comprehensive evaluation study to explore the rationale behind the various design choices and implementation techniques. The performance study compares these indices from multiple perspectives, including query throughput, scalability, latency, memory consumption as well as cache/branch miss rate, using various query workloads with different characteristics. Our results indicate that there is no one-size-fits-all solution. For example, PSL achieves better query throughput for most settings, but occupies more memory space and can incur a large overhead in updating the index. Nevertheless, the huge performance gain renders the exploitation of modern hardware features indispensable for modern database indices.
In the second work, we present a new type of analysis pattern discovered in modern OLAP scenario. To analyze user behavior over time, it is useful to group users into cohorts, giving rise to cohort analysis. We identify several crucial limitations of current cohort analysis, motivated by the unmet need for temporal dependence discovery. To address these limitations, we propose a generalization that we call recurrent cohort analysis. We introduce a set of operators for recurrent cohort analysis and design access methods specific to these operators in both single-node and distributed environments. Through extensive experiments, we show that recurrent cohort analysis when implemented using the proposed access methods is up to six orders faster than one implemented as a layer on top of a database in a single-node setting, and two orders faster than one implemented using Spark SQL in a distributed setting. In future, a new system called Time-Dependent Data OLAP should be presented as the third work of this dissertation as the practical application of the previous works.