Efficient OLAP for big data
COM2 Level 2
MR3, COM2-02-26
closeAbstract:
OnLine Analytical Processing (OLAP) faces many challenges in the era of big data. The exploded volume of the immutable data hinders OLAP system achieving high service level agreement with traditional indexing mechanisms. Further, the traditional OLAP system can hardly support new analysis patterns like cohort analysis or its variant in user behavioral data. In this dissertation, starting from the data indexing in storage layer to query processing engine, we present a general framework for emerging OLAP system in a bottom-up manner.
First, the performance of the storage layer is significant for OLAP systems, where the key factor is the indexing structure employed. The fact of coarse granularity of data accesses and the heavy use of latches, makes most indices in the B-tree family not efficient nowadays, especially in the context of multi-core architecture under a huge amount of processed data. Consequently, we study the parallelizability of skip lists for the parallel and concurrent environment, and propose PSL, a Parallel in-memory Skip List that lends itself naturally to the multi-core environment, particularly with non-uniform memory access. Moreover, five state-of-the-art indices, including FAST, Masstree, BwTree, ART and PSL, are selected into a comprehensive evaluation study to explore the rationale behind the various design choices and implementation techniques. The performance study compares the candidates from multiple perspectives, including query throughput, scalability, latency, memory consumption as well as cache/branch miss rate, using various query workloads with different characteristics. Our results indicate that there is no one-size-fits-all solution and different indexing structures may fit in different scenarios.
Second, because of the increasing variety of data, new analysis methodologies are needed to process new queries in OLAP systems. For example, to analyze user behavior over time, it is useful to group users into cohorts, giving rise to cohort analysis, in multiple applications. During our usage on such queries, we identify several crucial limitations of current cohort analysis pattern, motivated by the unmet need for temporal dependence discovery. To address these limitations, we propose a generalization version, called recurrent cohort analysis. We introduce a set of operators for recurrent cohort analysis and design access methods specific to these operators in both single-node and distributed environments. Through extensive experiments, we show that recurrent cohort analysis when implemented using the proposed access methods is up to six orders faster than one implemented as a layer on top of a database in a single-node setting, and two orders faster than one implemented using SparkSQL in a distributed setting.
Meanwhile, we notice that it is hard to find an integrated OLAP system which can both support the emerging analysis patterns and traditional OLAP queries. Existing systems suffer from unsatisfied performance due to the large space consumed during query processing. Therefore, we present a cohort analysis supported OLAP system called COOL as the last work of the thesis. With a sophisticated storage design, the system can achieve fast query processing as well as low memory consumption. Moreover, the system can be faster with time dimension optimization and precomputation in OLAP queries. According to the experimental results, COOL beats Monetdb with one order of magnitude and takes only one third time to answer the queries compared to Druid. We further extend COOL into distributed settings and it performs better than SparkSQL based on the reported results in the complementary experiment.