Join Query Optimization with Improved Selectivity Estimation and Complete Join Reorderability

Abstract:
Query optimization in database systems is to find good execution plans for database queries. When a query involves more than one join operator for joining tables, a critical task of the query optimizer is to decide on an efficient order for the join executions. The optimization of multi-table join queries involves two key tasks. The first task (join reordering problem) is the enumeration of the search space of possible candidate join plans, and the second task (join estimation problem) is the estimation of the cost of each join plan. We study both problems in this thesis.
First, for the join estimation task, the key problem is to estimate the result cardinalities of table joins. There are two main approaches for join cardinality estimation, histogram-based approaches and sampling-based approaches. Compared to histogram-based approaches, sampling-based approaches perform better even with a large number of attributes and can support general join predicates. However, simply sampling each table independently does not provide good estimation accuracy as it is unable to capture the correlation between join attribute values. In this thesis, we look at a promising new technique called correlated sampling for estimating join result cardinalities which considers the join attribute correlation when doing sampling. However, existing correlated sampling approaches are restricted to simple scaling-up estimation method, and their estimation variance is high for small samples. In this work, we present a systematic study of correlated sampling techniques by first developing a framework to capture the key parameters of its design space. Based on this framework, we design a new family of correlated sampling techniques that uses a more advanced discrete learning method for estimating join cardinality. We also design different sampling strategies for different data characteristics. Our new approach is shown to outperform the state-of-the-art approach in terms of estimation quality, and is more robust to small samples and skewed data.
For the join reordering task, we look at reordering queries involving outerjoins and antijoins, which is significantly more challenging than reordering inner join queries as outerjoins and antijoins are in general not associative or commutative. Existing techniques allow for very limited reordering for outerjoin and antijoin queries. To address this limitation, in the second work, we formally define the notion of complete join reorderability, and develop a compensation-based approach that can completely reorder queries involving inner joins, one-sided outerjoins, and antijoins. The approach uses compensation operators to compensate for invalid reorderings, and gives the complete set of rewriting rules for reordering joins and compensations. We also introduce a new top-down plan enumeration algorithm with cost-based pruning and subplan reuse, which is non-trivial in the presence of compensation operators.
In the third work, to extend our join reordering approach to achieve complete reorderability also for queries involving full outerjoins, we introduce a new compensation operator which generalizes existing compensation operators, and derive the complete set of rules for reordering full outerjoins and other join operators, as well as the rules for reordering joins and the new compensation operator. We also extend the SQL-level implementation techniques of compensation operators to support full outerjoin queries, by redefining a concept called nullification sets and developing a sound algorithm for implementing the compensation. Experimental evaluation on well-known benchmarking datasets demonstrates that our join reordering approach can effectively improve the query performance with the enlarged query plan search space.