PH.D DEFENCE - PUBLIC SEMINAR

The Distance-Aggregation Query in Sub-linear Time

Speaker
Mr Lei Yifan
Advisor
Dr Anthony Tung Kum Hoe, Professor, School of Computing


23 Apr 2021 Friday, 09:30 AM to 11:00 AM

Zoom presentation

Abstract:

One of the central topics in the database and data mining fields is indexing the numerous data that is constantly generated and being able to answer some queries efficiently leveraging the built index. In this thesis, we study a very general kind among those query problems that we name as Distance Aggregation Query (DAQ) problem. Many special instances of DAQ, such as Nearest Neighbor Search (NNS) and Kernel Density Estimation (KDE), are extensively used in many applications. By generalizing these instances of DAQ, we formalize the definition of DAQ problem that reports the aggregation of distances between a given query and data objects considering various of distance functions and aggregation operators. In order to solve the DAQ problem, we propose a framework, Locality-Sensitive Hashing (LSH) based Distance Aggregation Search (LSH-DAS), which consists of two components: the LSH family and the search framework. LSH-DAS can handle many instances of DAQ with various distance functions and aggregation operators by using proper LSH families and search frameworks. To further enable the possibility of using more distance functions and improve the performance of different aggregation operators, we develop some novel techniques on both LSH families and search frameworks.

We first propose two algorithms for the NNS over weighted Euclidean distance with two new (asymmetric) LSH families. NNS over generalized weighted space is a fundamental problem which has many applications in various fields. However, to the best of our knowledge, there is no sublinear time solution to this problem. Based on the idea of Asymmetric Locality-Sensitive Hashing (ALSH), we introduce a novel spherical asymmetric transformation and propose the first two novel weight-oblivious hashing schemes, SL-ALSH and S2-ALSH, accordingly. We further show that both schemes enjoy a quality guarantee and can answer the NNS queries in sublinear time. Evaluations over three real datasets demonstrate the superior performance of the two proposed schemes.

Then, we propose a novel LSH scheme based on the Longest Circular Co-Substring (LCCS) search framework (LCCS-LSH) for NNS. We introduce a novel concept of LCCS and a new data structure named Circular Shift Array (CSA) for $k$-LCCS search. The insight of the LCCS search framework is that, if a data object is close to the query in the original space, then this object will have a longer LCCS than the far-apart ones with high probability. Applying with different LSH families, the LCCS-LSH scheme supports $c$-ANNS with different kinds of distance metrics. LCCS-LSH enjoys a quality guarantee on query results. We also introduce a multi-probe version of LCCS-LSH and conduct extensive experiments over five real-life datasets. The experimental results demonstrate that LCCS-LSH outperforms state-of-the-art LSH schemes.

Next, we propose an improvement over the existing LSH search framework for KDE, and we show a real-life application of detecting obstacles using the proposed search framework for KDE. We start by studying a new data mining problem of obstacle detection from trajectory data. During formalizing the definition of obstacle detection, we find that the Gaussian KDE can be used to estimate the density of sub-trajectories. Therefore, to handle the KDE problem and to detect obstacles properly, we propose a novel framework to Detect Implicit Obstacles from Trajectories (DIOT), which utilizes a novel approximation to the Gaussian kernel for density estimation and a depth-first retrieval method for obstacle detection. Extensive experiments over two real-life datasets show that the proposed framework DIOT can capture the nature of obstacles yet detect the implicit obstacles effectively and efficiently.

Finally, we conclude this thesis and show several possible directions of future works on extending LSH-DAS to more distance functions and aggregation operators.